The Nanhai Corpus is a collection of word-segmented Tibetan totaling some 1.2 million words. The corpus contains 3 main sections: (1) Natural Speech (2) Scripted & Prompted Speech (Dialogs and Topics on Buddhism & Monastic Life) (3) Modern Writing (News [from Amdo, Kham, Lhasa, and Central dialects] and Children’s Literature). A 4th section is under construction, representing Middle Tibetan texts (aka “Classical” Tibetan).
We’ll be applying what we learn from this data to future projects, including textbook creation, Tibetan word-editing software, and more. We’ll also be expanding the corpus with new data over the next few years.
You can find the raw corpus data (which is free and open source) here: https://github.com/Esukhia/NanhaiCorpus