We are pleased to announce that Esukhia has two articles featured in the latest edition of the University of California’s peer-reviewed academic journal Himalayan Linguistics (Volume 15, Issue 1). This was a special issue dedicated to Natural Language Processing for the Tibetan language—in other words, research in Tibetan that uses or creates digital resources. (For more general information of NLP in Tibetan, read Nathan Hill’s introduction to the issue, found here.)
We’ve re-published the article abstracts here for your convenience. Please follow the links for the full articles! The first describes a word segmentation tool as well as a method for grading reading material (while providing researchers with information on Tibetan syntax). The second focuses on educational applications for Tibetan NLP.
Towards describing Tibetan syntax: From word segmentation to rewrite rules through a semi-automated workflow
Hildt, Hélios (Université Bordeaux Montaigne; Esukhia France, President & Co-founder)
The first task in Tibetan Natural Language Processing is word segmentation. We present our lightweight segmentation tool that is based on lexical resources. It can be executed natively in InDesign, and the user can update it with the manual corrections of its output. We then propose a semi-automated workflow aiming at syntactic analysis that uses utterance simplification and intonation cues to get precise information about the syntactical structure. Non-specialized native speakers are thus able to provide us with precise information about the structure of utterances. This will allow the scientific community to obtain the resources needed to initiate the study of Tibetan syntax. In this process, informants will obtain educational material generated from the utterances they will have processed.
Practical Applications for Corpora: The Role of Research-based Linguistics in Literacy & Education for the Tibetan Language
Schmidt, Dirk (Esukhia – Research & Development)
Corpus Linguistics and NLP have many obvious applications for researchers, academics, and other specialists; what should not be overlooked, however, is their role in improving the mundane, everyday interactions between people and language, be they a reader of a newspaper; a child with a storybook; or a student in a classroom. The language analyses that these linguistic tools provide have an important part to play in the feedback loop between authors, journalists, and pedagogists on the one hand and their audiences and students on the other.
While these sorts of research-based resources have already made splashes in majority languages like English, their ripples have yet to spill over into the smaller language markets. Within this paper we outline the ways in which corpus linguistics may inform Tibetan language literacy and education in both L1 & L2 contexts, while drawing from our own research into issues of readability and the development of a modern pedagogy for instruction in the Tibetan alphabet based on frequency data.