Corpus building for minority languages:
Applications


Lexicography

The Welsh corpus is being used by the University of Wales Dictionary of the Welsh Language Geiriadur Prifysgol Cymru. There is a mention of the corpus work here. Other lexicographical projects I've helped:

Grammar Checking

The 100 million+ words of Irish downloaded by the crawler have been instrumental in the development of my grammar checker An Gramadóir, and many other NLP applications for Irish. Other language groups have use the Crúbadán data for developing grammar checkers, including for Afrikaans (Petri Jooste), Breton (Thierry Vignaud), Cornish (Edi Werner), Occitan (Bruno Gallart), Walon (Pablo Saratxaga)

Diacritic Restoration

Using Unicode it is possible to create electronic documents in most languages with all of the proper diacritical marks and extended characters. Nevertheless, for various reasons speakers of many languages do not do this when writing emails or blogs or producing other documents for consumption on the web. In Lingala, for example, there are tone markings as well as two extended vowels, the "open e" (ɛ), and the "open o" (ɔ). On the web, tone marks are generally omitted and these vowels are written as "e" and "o" respectively, so "abɔkɔ́lɛ́kɛ́" becomes "abokoleke". This limits the usefulness of web texts for statistical purposes. To improve this situation, I wrote a script called "charlifter" that performs statistical diacritic restoration on web texts. This greatly enhances the usefulness of the Crúbadán corpora. The charlifter is also an application of the web crawler, in that the statistical language models it uses are created from the (rare) texts found by the crawler that use diacritical marks and extended characters correctly.

Language Recognition

The n-gram statistics gathered from the corpora for each language provide a powerful and effective language recognition algorithm. Of course particular care must be given to language pairs with very similar n-gram profiles; see the Language Similarity Table for more on this.

Hearing Testing

I've provided Mongolian corpus data to Drs. Richard Harris and Shawn Nissen at Brigham Young University for their project Development of digitally recorded Mongolian Speech Audiometry Materials, the aim of which is to produce low-cost hearing tests for Mongolian speakers. I provided additional material for work-in-progress on the Samoan language.

Dasher

Dasher is a free software package developed at the University of Cambridge that allows efficient text-entry without a keyboard. It uses a language model trained on text corpora to help it make predictions; the Dasher developers are training 129 new language models using the An Crúbadán corpora.

Machine Translation

This is one of my primary interests. I've supplied web-crawled monolingual and parallel corpus data to several groups working on machine translation involving at least one minority or under-resourced language:

Other Projects

Assorted other projects in computational linguistics:

Spell Checking

New word lists

I've written a series of statistical filters that can be applied to the web corpora to generate clean word lists suitable for spell checking. These techniques have been applied to create the following spell checking packages:

Abandoned projects or works in progress

Please contact me if you speak one of these languages and would be willing to help.

Abandoned projects, word lists now available elsewhere

Improved word lists

The Swahili corpus was used to enhance the word list originally created by Jason Githeko of Egerton University, Njoro, Kenya. Read the Press Release.

Data for Northern Saami were provided to the Divvun project, who are developing open source morphological analyzers, spell checkers, and hyphenators.

I am hoping to use the Quechua corpora to help the Runasimipi project, which is dedicated to the development of language technology and localized software for speakers of Quechua. We will need some help to train the language recognizer to distinguish the many dialects found on the web; please contact me if you're interested in helping.

I provided frequency lists that underlie the open source spell checkers for Low Saxon (Heiko Evermann) and Luxembourgeois (Michel Weimerskirch).

Jacob Sparre Andersen has a powerful email-based editing system in place for Faroese. We've recently succeeded in arranging things so that our programs work together nicely: candidate words are extracted by An Crúbadán from the Faroese corpus using the techniques described below; these are sent automatically to Jacob's server which prioritizes them and sends them via email to volunteer editors. Each night An Crúbadán can then download the modified word lists (of verified-correct and verified-incorrect words) which are in turn used to improve the crawler's language model, allowing more documents to be harvested and new words to be suggested etc. etc.

If you'd like to set up a similar system for your language using the An Crúbadán corpora, first download and install Jacob's "speling.org" system, available from his site, and then contact me.

How it works

First, statistics measuring co-occurrence with the highest frequency words in the target language are used to filter out sections written in other languages or containing mostly noise (e.g. computer code, tabular data, etc.). The remaining text is tokenized and used to generate a word list sorted by frequency and the lowest frequency words are filtered out. Then, depending on the target language, correctly-spelled words from one or more "polluting" languages are filtered out to be checked by hand later. Usually this means English, but I also filter Dutch from the Frisian corpus, Spanish from Chamorro, etc. The remaining words are used to generate 3-gram statistics for the target language. These are used to flag as "suspect" any remaining words containing one or more improbable 3-grams. Additional filters specific to certain languages can be applied optionally; for instance, pairs of words differing only in the presence or absence of diacritical marks can be flagged, or words with a capital letter appearing after the first letter, words with no vowels, etc.

Please contact me (kscanne at gmail dot com) if you are interested in applying these techniques to a new language.


© Cóipcheart/Copyright 2004 Kevin P. Scannell