SEAlang Library: Features for Teachers (preliminary guide, Spring 2006)

SEAlang Library: Features for Teachers

(Summer 2006)

Introduction

The SEAlang Library is a tool for teaching and research, as well as for student reference. The dictionary, corpus, and bitext resources are all capable of producing materials for classroom, testing, and study, as well as for helping the instructor gain fresh insight into the kinds of problems that students face.

Coverage The SEAlang Library will provide dictionaries, corpora, and (when available) bitext corpora for the national languages Thai, Burmese, Lao, Khmer, and Vietnamese, and dictionaries for Mon, Shan, and Karen.

Data contributions We are always eager to extend SEAlang Library coverage, so please get in touch if you have additional dictionaries or texts for any of these languages.

Other languages For 2006-2009, SEAlang is focusing on Southeast Asia’s non-roman scripts. We plan to begin adding roman-script SEA languages in 2009. However, if you have materials available in electronic form now, we are willing to add them.

Availability of features Library features will vary depending on underlying data resources. Our goal is to get existing texts on line as rapidly as possible, adding two languages per year, with one or two dictionaries per language. While we’re working with the best materials we could find, there are variations in quality. Still, each language makes the best use of available resources.

Design All of the SEAlang tools provide significant innovations in functionality, user-interface design, and in the display of potentially very large amounts of data. SEA language-specific features have been incorporated whenever possible. Please let us know if any additional features would be helpful.

Extension The SEAlang Library is designed to allow ongoing extension and updates as new materials become available, and as you – the SEAlang user community – become interested in improving existing resources.

Tools 1: Dictionary

Searching The SEAlang Library’s digital dictionaries are not simply electronic equivalents of traditional print texts. Rather, they allow many kinds of searches that printed books are not capable of providing. These include:

Phonemic approximation. Students frequently have trouble identifying contrasts in tone, vowel length, vowel value, nasalization, voicing, and aspiration. The approximation tools let students sidestep pitfalls in searching, and let the instructor extract minimum contrast sets for drill and practice.
Orthographic aproximation. Languages with extensive character sets (especially Khmer and Thai) are difficult for students. Loanwords that use irregular spellings are a particular challenge. The approximation tool lets the student find words he or she almost knows how to spell, and lets the instructor extract all instances of well-known pitfalls (e.g. Thai’s leading ทร- or final -จ).
Combined/restricted approximations. Phonemic and orthographic approximation can be combined to generate particular forms of irregular spelling. For example, a search for Thai .*ล with phonetic .*n yields all words in which final ‘l’ is realized as ‘n.’
Vernacular phonetics. In some cases SEA languages have local phonetic systems, used in vernacular monolingual dictionaries. We support such searches when applicable (in particular, for Thai). .
Reverse searches. English reverse searches (of the definition content) are supported.
Derivational expansion in reverse searching. Approximate matching can also be used to expand search targets to include derived forms; e.g. house expands to house, houses, housing, and housed.
Restriction by tags: POS, etymology, usage, subject. All internal dictionary tagging information can be used either to restrict a search (e.g. to verbs only), or to retrieve all items of a particular variety (e.g. all items tagged as ‘polite’). This is very helpful in preparing to discuss grammar, syntax, social register, and the like.
Compound segmentation. We segment (nearly) all compounds, and link each component to the proper head so that the student can track the compound back to its components. The exceptions are words derived from glossaries, rather than conventional dictionaries.
Data mining. The set of compounds included under a given headword usually represent only a fraction of actual headword appearances (especially in conservative dictionaries that only list compound led by the headword). We dynamically build each headword’s complete set of compounds, and separate the list into leading, trailing, and embedded items. The same exception for words from glossaries applies.

Approximation The exact details of phonemic and orthographic approximation vary from language to language, but the underlying principles are always the same:

Vowels. Vowels are grouped into sets that are likely to cause perceptual problems for Western students. For example, o|ɔ, e|ɛ, and u|ʉ|ə are common groupings, as are ua|ʉa and a|aa, i|ii, e|ee, and so on.
Consonants. Consonants follow a similar pattern: b|p|pʰ, ʃ|ʧ|ʤ|ʧʰ, m|m̥, and so on.
Tones. All tones are grouped together. At present, they are ignored by default in Thai. In Burmese, nasalization and glottalization are treated as tones.
Shortcuts. C replaces any consonant, V stands in for a vowel, D for dipthong, and T for tone.

Although we have a fair amount of insight into the best rule set for each language, we consider this to be a preliminary implementation and welcome user comments based on classroom experience.

Wild cards All Unix-like regular expressions (e.g. as found in perl, vi, or grep) are allowed. In particular:

^ match at beginning
$ match at end
. match any single character
.* match any sequence of characters

Match length Phonetic and orthographic matches can be restricted to being a) whole words or compounds, b) headwords or words found within compounds, or c) syllables or longer. In some cases, we use dictionary data that does not distinguish between headwords and compounds, so this rule cannot always be applied properly.

Q & A
Why don’t all of the entries have part-of-speech / phonetic / antonym & synonym / classifier / etc. data? We take the dictionary data as we find it. Our first goal has been to get useful data on line – expect cleanup and improvement over the next few years.

Why was there a question mark / square box in the phonetic? Occasionally an oddball character (usually in the phonetic) wasn’t cleaned up or converted to Unicode properly – we’re fixing these asap.

Why isn’t every compound word associated with the proper head? When a head has etymologically distinct orthonyms, compounds have to be segmented and associated with the right head individually. This takes time, but we’re working on it (e.g. we’ve just disambiguated more than 13,000 Burmese compounds). Note that an appropriate head entry doesn’t always exist in the original dictionary, either.

Tools 2: Corpus

Monolingual text corpora have attracted considerable interest in the past decade for several reasons.

First, they provide an objective view of the ways that words are actually used most frequently. This has had a dramatic effect on lexicography, where custom, rather than evidence, has often dictated the order and extent of dictionary definitions, often with the result that very common bleached or figurative meanings are ignored.
Second, they elicit evidence of the meanings that words appear less frequently, and which are not always adequately addressed in bilingual dictionaries. This is particularly important in Southeast Asian languages, which rely heavily on compound constructions.
Third, they reveal multi-word constructions that not always recognized as compounds, or seen as suitable for dictionary entry (e.g. English prepositional phrases, or in the SEA context, serial verb constructions).
Finally, they provide an effective index to native-speaker judgements of appropriate lexical choice when syntax and grammar do not come into play.

Native-speaker ability is not always helpful in anticipating the difficulties that students will encounter. For example, native speakers automatically filter out traditionally assigned literal meanings that conflict with common sense (do we really beg for a pardon in English, or ask for punishment in Thai?), but learners do not have this built-in radar. Corpus evidence encourages the teacher (and lexicographer) to account for such uses sensibly.

For teachers of less commonly taught languages, text corpora can play an important role in filling the gap left by a lack of suitable guides to grammar and syntax. The SEAlang Library corpus near feature, as well as the ability to restrict collocates to particular usage or parts of speech, are specifically designed to elicit larger-scale text phenomena. This includes split constructions, modals (which may be restricted to preceding or following a verb), classifiers, class terms, and so on.

A corpus is also an excellent source of drill and test material. Corpus results can be cut-and-pasted, cutting out difficult words, or otherwise modifying if necessary, to create a stock of raw materials for cloze tests, rearrange-the-word drills, translation tests, etc. The ready availability of such material is particularly helpful for real-world classroom environments, where it may be helpful to create multiple sets of roughly equivalent texts to serve as practice guides and makeup tests.

The SEAlang Library corpus tool provides these basic functions:

Collocate search. This finds the search target’s immediate neighbors (an underscore indicates that the neighbor was a space or newline). At present (Summer, 2006) the target-collocate pairs are sorted by frequency; in the near future subtler forms of collocate analysis, which help discount the frequent appearances of very commonplace words, will be implemented as well.
Context search. This finds collocates, then returns example contexts (default 5) for each.
Merged search. This combines the collocate and context searches into a limited brief view, or a more elaborate detailed view.
Raw search. Because SEA texts are normally not segmented into words, simple string matches would yield incorrect results (e.g. finding imp in simple). The Library performs peephole segmentation – it segments the target’s immediate neighborhood, then returns the item only if it’s a plausible word with plausible neighbors. Unfortunately, this excludes words if they (or their neighbors) aren’t found in the dictionary. Raw search overrides the peephole segmentation step, and returns the context.

For example, here is a search for collocates. The items highlighted in blue are tems that are already dictionary entries. This feature is particularly helpful for revealing items that should be compounds listed in the dictionary, but are not.

Here is a context search, showing both left- and right-hand matches (these can be retrieved separately as well; note the yellow ‘show leading …’ and ‘show trailing …’ tags above). The < or > means “the word came from this side”:

Two search targets can be provided:

A | B. Match terrm A or term B. This is useful to compare alternative spellings, or two distinct words that have similar semantics (e.g. English disaster vs. catastrophe).
A/B. Match AB or BA as a single word. This is helpful for finding some kinds of expressives or euphonic doubles.
A ~ B Match A if it is close to B (distance can be specified). This locates any kind of split construction.
A ~~ B Equivalent to A ~ B | B ~ A. This helps locate and explicate items that can appear in either pre- or post-position, but not always both, and not always with the same meaning.

The corpus itself does not require advance preparation of any kind (other than being in plain-text Unicode). Please contact us if you have a specialized corpus (e.g. transcribed speech) you would like to share.

Text corpora can be very large. While this is necessary for finding less common targets, subsampling is a more effective alternative for ordinary terms. In practice, most results are found using these three steps:

The complete corpus is scanned for any possible match.
Peephole segmentation, described above, is then applied to discard likely false matches (which occur if the target was embedded in a longer word, or fell across two words).
A subsample (default 1,000) of the likely matches is selected, sorted, and returned.

Thus, for common search targets the corpus tool will produce a different set of results each time. The gross distributions of word + collocate(s) will remain more or less the same, but the specific examples counted and returned will be different.

Finally, the ability to specify and/or restrict collocate types is still being developed. For example, symbolic entries like N (number), C (classifier) and so on are reasonable candidates for implementation, as is the ability to require that collocates have particular POS or usage tags. Please contact us (preferably with a prepared list of items) if this sort of specification would be helpful.

Q & A
Why are there oddball characters (like question marks) at the beginning and end of each corpus line? These are fractional parts of Unicode characters. We’ll be cleaning these up soon.

What does the underline mean? An underline: “_house” represents a space or newline.

The word you say is the left or right neighbor is obviously just part of a longer word. How come it was returned anyway? Because perfect segmentation by computer is hard!

Tools 3: Bitext Corpus

Bitext Corpus Aligned bitexts are a traditional tool of European language instruction, where bilingual literary texts have been widely available for many decades. They are rarely used in Southeast Asia, however, even in ESL programs. Bitexts are extremely helpful in three applications.

First, they encourage extensive reading. This is important for SEA texts, whose complex scripts present an unusually high entry barrier. Work volume – the number of words read – is critical in developing reading skill. While close, intensive reading certainly has its place, it is not always conducive to reading for pleasure. Bitexts, even if just skimmed in advance, give the student enough of the gist of the story to help him or her avoid getting stuck on loanwords, doubles, names, and other roadblocks to reading nonsegmented texts.
Secondly, they provide excellent source material for focused data-driven learning. In this application, the student follows ordinary dictionary reference with a sample of the new word in actual translated contexts. Bitext lookup can dramatically extend the few phrasal translations typically found in a dictionary.
Finally, bitexts give a broader, more authentic view of the expressive choices available to the translator or second-language speaker. In this application, the student’s native language can serve as the search key. For example, a search for the word home can be set to exclude all contexts that include a common translation equivalent (for Thai, this might be บ้าน). Few Southeast Asian languages have dictionaries or thesauruses that are up to such a task.

The SEAlang Library bitext tools provide these basic functions. Two search targets can be provided:

A | B. Match terrm A or term B. This is useful to compare alternative spellings, or two distinct words that have similar semantics (e.g. English disaster vs. catastrophe).
A/B. Match AB or BA as a single word. This is helpful for finding some kinds of expressives or euphonic doubles.
A ~ B Match A if it is close to B (distance can be specified). This locates any kind of split construction.
A ~~ B Equivalent to A ~ B | B ~ A. This helps locate and explicate items that can appear in either pre- or post-position, but not always both, and not always with the same meaning.

Searches can be in either, or both, a Southeast Asian L1 or (usually) Western L2. Four match possibilities are supported:

Southeast Asian or Western.
Southeast Asian and Western.
Southeast Asian and not Western.
Western and not Southeast Asian.

These alternatives can be extremely helpful for finding atypical translations or expressions.

As in the dictionary reverse search, the bitext corpus tools supports derivational expansion of English search terms, so that house expands to house, houses, housing, and housed.

At present (Summer, 2006) only Thai-English bitexts are supported (but only because closely translated material is so difficult to find). Most of our source material was originally English, given a close translation into Thai by design. In our experience, translation from a Southeast Asian language tends to be problematic, and does not yield the orderly sentence-by-sentence alignment usually sought from bitext corpora in this application.

Q & A
Why isn’t the search word’s translation highlighted as well? Because the bitexts are only aligned sentence by sentence. You’re welcome to pursue spotting the translation as a research project!

Other Research Applications

The SEAlang Library is meant to support research in SEA linguistics and language education. Please see the Programmers Guide for additional information on system features and implementation.