|About the SEAlang Library Thai Text Corpus|
|This mononlingual corpus consists of Thai texts published on the Internet, sampled here for research and educational purposes.|
|- context searches show how the search target appears in context, taking both leading and trailing collocates (or neighboring words) into account.|
|- collocate searches are better for focusing on the search target's immediate neighbor. This search returns separate lists of leading and trailing collocates.|
|- merged view allows for fast switching between collocate and context views. The Go! button invokes the brief view (recommended).|
|- raw contexts show the search word in context without any attempt at analysis or sanity-checking (local segmentation that helps ensure that a real word has been found).|
restrict collocates requires (or forbids) all collocates to have at least
one sense with a particular part of speech or usage.
These specialized wildcard characters represent lists of words. Predefined wild things are: N (numbers: Thai/roman digits, words for numbers, and partitives [some, every, etc.]), D (demonstratives: this, that, etc., as well as a, the first / last, etc.), Q (questions: interrogatives like why, how, or not, etc.). learn more . . .
The symbols < and > specify required left- or right-hand neighbors. For example, a < b > c only matches b if it is preceded by a and followed by c. This is very helpful in conjuction with wild things, above.
Southeast Asian writing is normally broken into phrases, rather than individual words. Rather than trying to segment our corpus texts in advance, we use peephole segmentation - we only try to segment the search target's immediate neighbors, on the fly. This will sometimes produce incorrect results. However, it is far more robust, and returns much more potentially useful data than pre-segmentation would.
Because the underlying text corpus may be quite large (more than 50 million characters in this implementation), results may be taken from a random sample of hits. For common words, this means that sample contexts and exact collocate frequencies will vary from run to run.
|Clicking on a word/collocate with the mouse starts a new search: yellow searches for contexts, and black searches for collocates.|
See the Features for Teachers and Features for Students
on the SEAlang Library page.
|Look for continuing development of SEAlang Library Thai resources.|