Computer Assisted Proofreading of the Tibetan Buddhist Canon
The Tibetan Buddhist Canon, aka the Kangyur and the Tengyur, were edited and published various times by Tibetan masters and Chinese emperors. While all these editions represent monumental effort, it often takes a seasoned philologist or scholar with expert knowledge to make sense of all its variant spellings and archaic language. This project—Computer Assisted Proofreading of the Tibetan Buddhist Canon—makes use of modern technology and the advances in the field of Natural Language Processing to create a reader-friendly General Edition of the Buddhist Canon. This edition will be published on paper and in the cloud to gather and consolidate the best spellings of prior editions without losing any of their respective specificities.
Imagine that recording technology existed during the time of the Buddha, and that all his speeches had been recorded on cassette tapes—an audio reproduction of Buddha’s speech. Over time, the cassettes were handled by many hands; they were played again and again on many cassette decks; and, they were duplicated and copied and recopied many times over. And, even though many expert audio technicians cleaned them the absolute best anyone possibly could, and remastered the audio using the best technology available to them at the time, inexorably, time changed the quality of the recording, distorting and warping the signal.
The various versions of the Tibetan Buddhist Canon are just like these cassettes; they have been proofread and published and re-proofread and re-published numerous times in the past—and this process continues even today. The intent of each of these reclamation projects was to improve upon previous versions; however, by working manually in a traditional paper-only format, a formidable and laborious process, each proofreading iteration opened the door to creating new mistakes even as old ones were corrected, just as re-recording a cassette tape introduces warps, fuzziness, and distortion to the original signal.
Another way to think of it, too, is that they are like a photocopy of a photocopy of a photocopy… With each copy, you may carefully whiteout errors and write in corrections; yet even though you’ve carefully chased down each of these “flies”, the act itself of re-photocopying the photocopy “opens the door” for further distortions of the original:
Today, there are eight “tapes”, or authoritative editions, of the Translated Speech (Kangyur, bka’ ‘gyur) and four of the Translated Treatises (Tengyur, bstan ‘gyur); these have also been combined into a single diplomatic edition (Paydurma, dpe bsdur ma). This edition represents a tremendous improvement, as it provides readers the ability to compare differences between editions; however, we are still faced with the following challenges with the Tibetan Buddhist Canon as it exists today (“flies” we need to chase down):
- Spelling mistakes: spelling mistakes in original block prints are reflected in new editions (even digital input projects have erred on the side of creating a “perfect” reproduction of printed editions; in other words, inputters purposefully recorded block print errors, while making typos of their own), leaving all editions with:
- Old spelling mistakes from block prints, unresolved, that must be corrected
- New spelling mistakes introduced with new editions that must be corrected
- Archaic spellings: though there were several official language reforms, the most recent being the 10th century CE, the problem of old and archaic language has not been comprehensively addressed
- Archaic language needs to be updated
- Mistakes in the comparative notes must be addressed
- Instances of conflicting and/or contradictory meaning need to be resolved
- Finally, all these improvements need to be realized in a digital format in order to be modern-media friendly—existing only on paper is a limitation given the number of technological advances
Why is it important to meet these challenges?
In general, we may say that literature in every language is a mode of communication; the written word is an ancient technology that allows authors to communicate and connect with people across time and space—just as a recording allows us to hear long-ago recorded speech. If modern audiences, including translators, are to be able to connect deeply to the texts of the Buddhist canon, to hear them pristinely, and understand them with all the clarity, nuance, and brilliance with which they were recorded, we need to address issues that will impede the relationship between author and reader, between speaker and hearer.
While we don’t profess to have the human skill and ability of the great scholars and masters of the past, we are confident that modern technology provides us with new and powerful tools to address issues in textual proofreading—new ways to digitally “remaster” old classics. Prior to word processing, publications in all languages were rife with human error in both spelling and grammar (just as tape cassettes were susceptible to fuzziness, distortion, and normal wear-and-tear); yet today, published works are nearly flawless because of the assistance of machine-assisted language tools, like the spell-checkers we use on a daily basis in word documents; on social media; and in chat apps.
Today, audio engineers are able to digitally remaster old recordings just as publishers are able to perfect old manuscripts. In other words, computer-assisted technology allows us to chase down “flies” while leaving the rest of the text untouched. Why can’t we do the same for Tibetan?
First, we use digital copies of the different versions of the canon and run them through a multi-phase human proofreading where errors (and potential errors) are flagged. We then use a series of statistical analyses to determine error types and their potential corrections; for example, by comparing Tibetan terminology to their Chinese corollaries and the Sanskrit originals, we may produce automatic corrections/suggestions (along the same lines as Google’s “Did you mean?”).
Key to the success of this project is the feedback loop wherein human proofreaders identify an error type, which is taught to the machine; the machine then searches for similar errors, and human proofreaders optimize future recognition by affirming the error (or denying it). This process leads to an ever-refined and self-correcting system that assists us in creating a version of the Tibetan Buddhist Canon that is as perfect as possible.
Benefiting Teachers, Students, Translators, & Readers Worldwide
Furthermore, social media platforms helps us to share this process with scholars of the Tibetan Buddhist tradition far and wide to add their valuable contributions and expert knowledge to the process; a collaborative platform allows these experts to be an intimate part of the proofreading process. Although the process will be continual, it is possible to “freeze” it at any point with the best version to date, and create a printout.
Meanwhile, the digital data becomes a truly living text, an open-source repository for all Buddhists on the planet to access, use, and improve by making suggestions for further edits, benefiting scores of teachers, scholars, and translators in the process. And, of course, when these experts are given the tools to clearly, accurately, and easily convey the Buddha’s speech, these improvements will reach students and practitioners, too!