LexiVault – SAVANT

LexiVault is a repository and web-tool for psycholinguistic lexicons of lesser-studied languages.

The repository serves as a collection of morphologically parsed Lexical Stimulus Databases for low-or-no resource languages modeled after similar resources that have facilitated research on well-studied languages such as English and Dutch (eg. the CELEX database, Baayen et al. 1995, and the English Lexicon Project, Balota et al. 2007).

This project is designed to support stimuli creation for psycholinguistic experiments with single word-based paradigms. The data within is thus focused on lexical and sublexical statistics, including morpheme frequency and phonotactic probability.

LexiVault is a work in progress. Currently we are developing lexicons for Bangla, Slovenian, Bosnian-Croatian-Serbian, and different dialects of Arabic. Check out the template page and contact us if you are interested in joining LexiVault and contributing your language lexicons!

About

Investigating psycholinguistic questions relies on fine-grained, corpus-derived measures. Psycholinguistic research is hampered by lack of computational resources for most of the world’s languages. Existing resources fall short of fully satisfying desired requirements for:

Accessibility
Broad language support for psycholinguistic research
Minimizing resource building efforts

Current goal: Build and collect resources that are structured for psycholinguistic inquiry of lesser-studied languages So we set up a baseline process and structure to follow as we build each of the language resources on our docket and beyond for eventual contributors to keep enriching LexiVault while still following a set of guidelines to keep the data relevant for psycholinguistic research, namely:

But still with a flexible enough format to accommodate a diverse range of lesser-resourced languages in terms of bare-bones universal features like word tokens and their frequencies, or for instance using IPA to represent phonetic information and the option to extend certain datasets to add language-specific attributes like roots and patterns for semitic languages

A sufficiently representative corpus in size and content to support significance of findings, and specific measures typically used in psycholinguistic studies, such as normalized frequencies, transition probabilities, and grapheme-to-phoneme transcription which typically doesn’t co-occur with morphological or lexical annotation in a same resource