On the importance of machine-readable lexicons in the study of South Asian phonologies: Demonstrations from a 16,000-word database of Garo


Charles Redmon and Triksimeda Sangma (2018)
Paper presented at Formal Approaches to South Asian Languages, 8 (Wichita)
Full database and final manuscript in preparation

Recent work on the typology and evolution of phonological systems has revisited earlier formulations in Martinet (1952) and Hockett (1967) in beginning to emphasize the role of the lexical distribution of speech sounds in the organization and function of phonetic contrast in language (Surendran and Niyogi, 2003; Wedel, 2004; Arbesman et al., 2010; Oh et al., 2015; Dautriche et al., 2017). However, just as standard phonetic, phonological, and psycholinguistic studies rely on a phonemic description, or grammar, of the language, lexical distributional analyses of this sort are also resource-dependent. Namely, the critical measurements in such work (e.g., phonotactic probability, functional load, neighborhood density) require the availability of machine-readable phonemically transcribed wordlists (referred to in the engineering literature as pronunciation dictionaries) covering a large, representative span of the lexicon of the language in question. Such resources are readily available in English (Balota et al., 2007; Baayen et al., 1993), Dutch (Baayen et al., 1993), German (Baayen et al., 1993), French (New et al., 2004), Spanish (Sebastián-Gallés et al., 2000), and Mandarin (Huang et al., 1997), among others, but are less commonly available for South Asian languages.

On this model, we present a new open-source lexicon of Garo, containing over 16,000 words with hand-checked phonemic transcriptions in a machine-readable flat text format, accompanied by Python build scripts and a free, multi-platform graphical interface for querying the aforementioned measurements and performing lexical searches on phonological patterns. From the Garo data we demonstrate how we are currently using distributional information to understand the Garo sibilant system and make lexically informed predictions to be tested in future production and perception experiments.

Status reports on the ongoing development of similar databases of Assamese, Khasi, Malayalam, and Telugu are also presented, which are currently aimed addressing longstanding issues in research on phonotactic dependencies in the acoustics of fricative place contrasts, complex consonant clusters, and retroflex consonants, respectively.