Creating a Spoken Corpus for the Tibetan Language

“When a language dies, a way of understanding the world dies with it.” ― George Steiner

Why is the Spoken Corpus Necessary?

  • Beginning readers lack a consistent, concrete path to Tibetan language literacy due to a gap between literary and colloquial Tibetan (linguists call this situation a “diglossia”—where “two languages,” which are closely related, are used by a single language community—in this case, Spoken Tibetan and Literary Tibetan)
  • This gap should be addressed by the creation of graded literature for the Tibetan language. The first step is:
    • Researching Tibetan language learning and literacy in order to create headword lists—these will form the basis of graded reading material and lessons


Beginning readers of all languages often have difficulty learning to read: Tibetan is no exception. There are, however, specific strategies that have been tried and tested in other languages that could be helpful in supporting Tibetan language literacy. As in the graphic above, most young readers will best gain the peak of literary sophistication if they are progressively eased and guided to the top.

Creating this pathway begins with understanding the linguistic terrain of the mountainside: research into the English language, for instance, has shown that approximately half of the words in speech and writing are made up of a mere one hundred words—and that if beginning readers learn these words first, it helps them learn how to read. This sort of research (along with basic phonics) has been instrumental in revolutionizing the way English-speaking children learned to read over the last fifty years or so:


If this kind of list is expanded to the range of the 2,000 most common words (the so-called “General Service List,” also derived from a corpus by using frequency analysis), it covers 90-95% of speech and 80-85% of common written texts. Education specialists who develop materials like Early Readers, which are specifically designed to help children begin to read, make use of frequency lists such as these.

This is because beginning readers are “learning to read” and not “reading to learn.” That is, they are learning how speech sounds (phonemes) are represented by written letters, or groups of written letters (graphemes)—learning to read is thus a sequential process that is based on the spoken language a child already knows. Research into vocabulary acquisition, for example, also shows that a child learns best when he or she understands 98% of the words within a text:


The first step in supporting beginning readers of the Tibetan language, then, is the research that would help us to understand the spoken language levels of Tibetans. By knowing exactly what words people use in their everyday language, reading material that directly relates to them can be created. This will be an invaluable asset in supporting literacy and learning in the Tibetan language for both L1 and L2 learners (first- and second-language learners).

It will not only help L1 students to learn to read and write Tibetan well: studies also show that mother-tongue literacy is a very important stepping-stone to literacy in second languages. Children who are better able to read and write in Tibetan, then, will also be building skills that will help them to be better able to read and write in English (or any other language). This research must begin with a spoken corpus.

(i) Contextual Considerations: The History of the Tibetan Diglossia

Key Points:

  • Literary Tibetan represents a version of spoken Tibetan—it’s just that this version is 1,000 years old
  • Historical examples of Tibetan language standardization represent an effort to make the literary language accessible—modern efforts ought to have the same aim
  • The Tibetan language situation may described as a “diglossia”—the “low” spoken language is superposed by the “high” literary language
    • Examples of other diglossias, such as Arabic, suggest this situation is an obstacle to literacy: this insight is applicable to the Tibetan language
  • A diglossia is an obstacle because beginning readers are “learning to read,” not “reading to learn;” unless beginning readers have access to level-appropriate reading material that reflects their spoken level early on, they may never be able to access higher level texts later
  • Research will allow us to understand these issues, specifically as they relate to the Tibetan language, and to then respond by allocating resources and developing materials and teaching methodologies more effectively

When, according to tradition, Thönmi Sambhoṭa formed the Tibetan orthography at the behest of King Songtsän Gampo, its standardized form reflected the central dialect of Kyishö (skyid shod), the region just south of Lhasa. It is here that the first translations were commissioned, and later, where texts were edited to reflect this official standard (though other dialects existed, they weren’t recorded).

Transcriptions of Old Tibetan into other languages confirm that Tibetan spelling more or less reflected pronunciation at this point in history (and some dialects are still closer than others in this regard, such as Ladhaki, Balti, and the Tibetan spoke in rgyal rong), and later reforms of the language also show a sensitivity to modifying the written word to reflect its spoken form.

There were three of these language reforms, all of which followed three main guidelines, to make sure that: 1) everything agreed with the grammar treatises for syntax and spelling rules; 2) the meaning of the translation corresponded to the meaning of the original Buddhist texts; and 3) the language used within was easy to understand for Tibetan readers.

It is this third edict that is most important: vocabulary was carefully selected from the regional vernacular, by order of the King, to reflect the meaning of the Buddhist texts; it was considered vitally important that the text be easily comprehended by Tibetan readers. One thousand years later, it seems time to reflect on whether or not this remains the case, and if the textual tradition has remained true to the intent of its founders.

After all this time, the literary language has changed very little, while the spoken dialects of Tibetan have continued to evolve: this is the primary reason that there is a gap between the written and spoken versions of the Tibetan language. This gap, called a “diglossia” by linguists, makes literary achievement a more difficult task than when there is a close relationship between the written and spoken languages.

The reason children have more difficulty learning to read in a diglossia is that they cannot directly relate to the text: they are not familiar with the vocabulary, grammar, or syntax. This issue has been well-documented in the Arabic language, where an education specialists note that since the literary language is disconnected from everyday reality, it’s become important to approach teaching reading and writing in new ways.

In general, “diglossias” (like Tibetan and Arabic and many others) occur where:

  • a) there is a large body of culturally defining literature (in this case, the Buddhist canon)
  • b) there are low literacy rates (an issue in Tibet)
  • c) the literature has been around for centuries (for Tibetan, around 1,000 years)

(ii) Modern Methods for Teaching Reading & Writing

Teachers in many different languages have come to the realization that a beginning student’s spoken language level is an important consideration in determining appropriate reading material. More recently, educators developing graded readers have begun using headword count as a strong indicator of reading level. Again, the importance of beginning readers reading to reinforce and enrich vocabulary that they already know cannot be understated; beginning readers work by making associations between the speech words that they know and their printed form.

Therefore, early reading texts should have almost the same vocabulary that a child does. Beginning readers are also developing a skill called “automaticity”—the ability to recognize words without having to sound them out each time—and this takes seeing these words many times over. This automaticity then builds reading speed, an important factor in reading comprehension. Meanwhile, new vocabulary can only obtained by seeing it in a known context: ideally, it’s said that new vocabulary should only be 2% of the text.

graphs2 (3)

Since this is the case, the question becomes, how do we identify Tibetan childrens’ language levels so that we can create reading material that best supports their learning? That is, if learning to read and write the Tibetan language requires educational material that reflects natural language levels, how do we know what those levels are? Additionally, what themes or stories interest them? The answer is, by creating a spoken corpus of Tibetan speech.

By recording and collecting the way in which Tibetan children naturally use their own language, we will better understand what language they use: what words they know (vocabulary) and the structures they use (grammar). By knowing what stories they tell, and in what ways they tell them, we’ll know what content interests children. Understanding level-appropriate language and content is the first step in developing educational material that directly targets children who are making their first steps on the pathway to sophisticated literacy.

(a) Introducing the Spoken Corpus

A language corpus is a linguistic database that is primarily used to analyze the frequency of various vocabulary and grammar structures and the many different connections between terms or sets of terms found within a language. Corpuses are thus useful to both linguists and educators—linguists use corpuses to study and describe how a language works (by studying the lexical, grammatical, phonological, or morphological patterns of the language), whereas educators are able to use its data as a reference guide for creating properly structured, graded lessons and materials based on authentic examples of the language as it’s used by native speakers. Our proposal focuses on this second use: a Spoken Corpus for Tibetan will be useful for developing educational reading material targeted toward Tibetan children and other learners.

(b) Broader Reasons for the Spoken Corpus

This research is particularly important for the Tibetan language since such a reference database does not yet exist for the language. Thus, the Spoken Corpus will benefit both Tibetan language researchers and, more immediately, Tibetan language educators, in a variety of language teaching environments. Moreover, the Tibetan language community is under linguistic pressure from many other languages: Chinese in Tibet itself, Hindi and English in India, and various other languages wherever Tibetans have migrated to.

Supporting reading and writing skills in a modern, standard Tibetan will also promote using the Tibetan language in modern mediums like SMS (texts), blogs (internet), emails, and more, where many Tibetans use these second languages in preference to their own mother tongue.

In other words, the Spoken Corpus will directly impact Tibetan’s ability to respond to what UNESCO deems to be the main factors of language endangerment. These are namely: (a) the documentation of the language, (b) the language’s response to new domains and media, and (c) the availability of materials for language education and literacy.

In India, it’s notable that Tibetan language coursebooks used in native classroom education remain ungraded, and subsequent generations of Indian-born Tibetans do not have the full language skills of their peers who are from Tibet itself:

graphs2 (11)

*In a preliminary study of native Tibetan speakers who have grown up in Tibet versus the diaspora, participants were asked to explain vocabulary from random pages in the dictionary. While Tibetans from Tibet knew, on average, 70% of the vocabulary, speakers of the diaspora knew less than 50% (in other words, speakers from Tibet proper have 50% more vocabulary than speakers from the diaspora).

A textbook, with accompanying teacher’s manual, to teach Tibetan in Tibetan, would be an invaluable resource for teachers of the language who may be teaching Tibetan children in the diaspora (the subject of this proposal), monastics in their institutions, Tibetan students of standard Tibetan (those Tibetans of various native dialects who wish to study the standard spoken language) or foreign students of the language.

We would also like to briefly address concerns people may have about writing and recording certain aspects of the spoken language, a necessity when creating a spoken corpus. The view of the གངས་ལྗོངས་མཁས་པའི་ལྷན་ཚོགས་, for instance, may be paraphrased by Khenpo Tsultrim Lodro:

“Nowadays, Amdo, Central, and Khampa vernaculars are diverging further and further: If we think of olden times, the language found in our religious and specialized literature was extremely consistent. We should be aware that this literary language is actually the standardized language of the Tibetan ethnicity. Apart from pronunciation differences, the literary language standardizes our language perfectly.”

Some may think that a simplified children’s literature, based on a spoken language corpus, would encourage a further divergence from the literary standard by using some aspects of colloquial language.

However, basing a children’s literature on a spoken corpus has exactly the opposite aim of language divergence: an extremely consistent, standardized language that would be representative of the Tibetan ethnicity can only continue if low-level literacy is addressed first. After all, children who read children’s literature will become adults who read adult literature! Although a modern, standardized children’s literature may, by necessity, reflect some very basic aspects of the spoken language (such as verbal auxiliaries), it needn’t incorporate “vernacular” aspects of language: regional lexicons, slang, idioms, or accent.

Indeed, Khenpo Tsultrim Lodro himself has begun to pave the way toward a modern, standardized Tibetan by publishing his dictionary of new daily vocabulary: the རྒྱ་བོད་དབྱིན་གསུམ་གསར་བྱུང་རྒྱུན་བཀོལ་མིང་མཛོད།. We believe that resources such as this, combined with research on the language use of native speakers, will be invaluable in reaching a consensus for terminology use within modern Tibetan literature. Since readers who read extensively also draw from their reading vocabulary while speaking, including vocabulary aims in extensive reading materials will also have a positive effect on the living language and help to build Tibetan speakers’ vocabularies.

In summary:

  • Language sophistication is only built over time
    • The foundation of this sophistication is unsophisticated, level-appropriate children’s literature
    • Since “learning to read” precedes “reading to learn,” it is especially important to focus on Tibetan literacy for children
  • A strong foundation in Tibetan language skills will support:
    • Foreign language skills (such as English)
    • Building vocabulary (improving Tibetan speaking skills)

Addressing these issues will lead to a more equitable reading conditions for all Tibetan peoples: providing an unsophisticated literature for more readers and writers in the short-term will lead to more sophisticated literature for more readers and writers in the long run