The Southeast Asian Languages Library

The Southeast Asian Languages Library

Southeast Asia is a global crossroads of singular geopolitical importance: an astonishing 30% of the world’s trade goods transit its central Malacca Straits. Unfortunately, we have little capacity for information access or language reference in this important and often tumultuous region. The mainland countries are represented least: data and dictionaries that use the Indic-derived scripts of Burma, Laos, Thailand, Cambodia and the minority Mon, Karen, and Shan states, and Vietnam’s Roman-derived, Chinese-influenced Quốc Ngữ script, are largely unavailable in the U.S., and there has been little software development beyond basic tools for text input and output.

The Southeast Asian Languages Library – SEAlang Library, for short – is a technically innovative plan to build core lexical re-sources for all Southeast Asian languages, starting with the difficult scripts used by the five mainland countries.

Broad support for the SEAlang Library reflects its importance to the Southeast Asian Studies community. In preparing this proposal, the University of Wisconsin-Madison Center for Southeast Asian Studies (CSEAS, host of the Southeast Asian Studies Summer Institute, SEASSI) and co-sponsor Center for Research in Computational Linguistics (CRCL Inc., a US 501(3)(c) nonprofit), have made concrete plans for cooperation with the Center for Khmer Studies (CKS, Siem Reap), the Ecole française d'Ex-trême-Orient (EFEO), the Committee on Research Materials on Southeast Asia (CORMOSEA), the Coalition of Teachers of Southeast Asian Languages (COTSEAL), and NGO-based ‘open source’ software projects in Burma, Laos, Thailand, Cambodia, and Vietnam.

The SEAlang Library will provide:

DICTIONARIES: we will prepare XML-metatagged digital bilingual dictionaries, based on the best available print reference works – often difficult to obtain from U.S. libraries – for the national languages Burmese, Lao, Thai, Khmer, and Vietnamese, and the major ethnic minority languages Mon, Karen, and Shan. These will be supplemented by historical dictionaries in cases of significant orthographic change, and extended, as possible, by lexicons of newly minted words. All SEAlang dictionaries will be accessible via approximate search software that locates national orthography, transliteration, or pho-netic transcription, and can be used both interactively, and as program-accessible Web resources.

TEXT CORPORA: we will build monolingual and aligned bitext corpora. Used to study collocation and usage, and to support data-driven language learning, these are necessary precursors to more ad-vanced translation and monolingual and cross-language information retrieval tools. We will provide substantial (to tens of millions of words) monolingual corpora for each majority language, along with the largest feasible (hundreds of thousands of words for Thai and Vietnamese, and less for others) aligned two-language corpora, drawn from both on-line resources and on-the-ground publishing contacts.

SOFTWARE: we will build information access tools for Southeast Asian scripts, including tools for segmentation and transliteration, conversion between font encodings, text harvesting and indexing, and statistical analysis. User applications, including the SEA-Search query builder, the SEA-Cat Library of Congress Romanization / cataloging utility, the SEA-Read reader’s helper, and the SEA-See text-as-image utility for scripts (like Khmer) that are difficult to render in Unicode, will be linked to dictionaries, text corpora, and transliteration engines to help fulfill the promise of regional information access.

The Southeast Asian Languages Library is a long-awaited addition to the national digital infrastructure being built with the support of a variety of U.S. Department of Education Title VI programs. It will enable:

pedagogy and new teaching, learning, and translation tools for less-commonly taught languages,

scholarly inquiry in linguistics, history, lexicography/etymology, and Southeast Asia area studies,

scientific research in computational linguistics and cross-language information retrieval, and

language reference all but unavailable to 1.8 million Americans of mainland Southeast Asian heritage who can typically speak – but not read, or consult reference materials in – their heritage languages.