Corpus building for minority languages:
Applications
Lexicography
The Welsh corpus is being used by the University of Wales Dictionary of the Welsh Language Geiriadur Prifysgol Cymru. There is a mention of the corpus work here.
Grammar Checking
The 100 million+ words of Irish downloaded by the crawler have been instrumental in the development of my grammar checker An Gramadóir, and many other NLP applications for Irish.Language Recognition
The n-gram statistics gathered from the corpora for each language provide a powerful and effective language recognition algorithm. Of course particular care must be given to language pairs with very similar n-gram profiles; see the Language Similarity Table for more on this.
Dasher
Dasher is a free software package developed at the University of Cambridge that allows efficient text-entry without a keyboard. It uses a language model trained on text corpora to help it make predictions; the Dasher developers are training 129 new language models using the An Crúbadán corpora.
Spell Checking
New word lists
I've written a series of statistical filters that can be applied to the web corpora to generate clean word lists suitable for spell checking. These techniques have been applied to create the following spell checking packages:
- aspell-az, hunspell-az (Azerbaijani). Joint work with Metin Amiroff.
- aspell-csb, hunspell-csb (Kashubian). Joint work with Roman Drzeżdżon and Piotr Formella.
- aspell-fy, ispell-fy, hunspell-fy (Frisian). Joint work with Eeltje de Vries.
- aspell-ga, ispell-ga, hunspell-ga (Irish).
- aspell-gd, ispell-gd, hunspell-gd (Scottish Gaelic).
- aspell-gv, ispell-gv (Manx Gaelic). Using earlier work of Alastair McKinstry.
- aspell-hil, ispell-hil (Hiligaynon). Joint work with Francis Dimzon.
- aspell-ku, ispell-ku, hunspell-ku (Kurdish). Joint work with Erdal Ronahi and Rêzan Tovjîn.
- aspell-mg, ispell-mg, hunspell-mg (Malagasy). Joint work with Rado Ramarotafika.
- aspell-mn (Mongolian). Joint work with Sanlig Badral. See the announcement (in Mongolian). I'm "Профессор Доктор Кэвин Сканнелл".
- aspell-ny, ispell-ny, hunspell-ny (Chichewa). Joint work with Soyapi Mumba.
- aspell-rw, ispell-rw, hunspell-rw (Kinyarwanda). Joint work with Steve Murphy and Philibert Ndandali.
- aspell-tet, ispell-tet, hunspell-tet (Tetum). Joint work with Peter Gossner.
- aspell-tk, hunspell-tk (Turkmen). Joint work with Jumamurat Bayjan.
- aspell-tl, ispell-tl, hunspell-tl (Tagalog). Joint work with Ramil Sagum.
- aspell-tn, ispell-tn, hunspell-tn (Setswana). Joint work with Thapelo Otlogetswe.
Improved word lists
The Swahili corpus was used to enhance the word list originally created by Jason Githeko of Egerton University, Njoro, Kenya. Read the Press Release.
Data for Northern Saami were provided to the Divvun project, who are developing open source morphological analyzers, spell checkers, and hyphenators.
I am hoping to use the Quechua corpora to help the Ciber-runa project, which is dedicated to the development of language technology and localized software for speakers of Quechua. We will need some help to train the language recognizer to distinguish the many dialects found on the web; please contact me if you're interested in helping.
Jacob Sparre Andersen has a powerful email-based editing system in place for Faroese. We've recently succeeded in arranging things so that our programs work together nicely: candidate words are extracted by An Crúbadán from the Faroese corpus using the techniques described below; these are sent automatically to Jacob's server which prioritizes them and sends them via email to volunteer editors. Each night An Crúbadán can then download the modified word lists (of verified-correct and verified-incorrect words) which are in turn used to improve the crawler's language model, allowing more documents to be harvested and new words to be suggested etc. etc.
If you'd like to set up a similar system for your language using the An Crúbadán corpora, first download and install Jacob's "speling.org" system, available from his site, and then contact me.
How it works
First, statistics measuring co-occurrence with the highest frequency words in the target language are used to filter out sections written in other languages or containing mostly noise (e.g. computer code, tabular data, etc.). The remaining text is tokenized and used to generate a word list sorted by frequency and the lowest frequency words are filtered out. Then, depending on the target language, correctly-spelled words from one or more "polluting" languages are filtered out to be checked by hand later. Usually this means English, but I also filter Dutch from the Frisian corpus, Spanish from Chamorro, etc. The remaining words are used to generate 3-gram statistics for the target language. These are used to flag as "suspect" any remaining words containing one or more improbable 3-grams. Additional filters specific to certain languages can be applied optionally; for instance, pairs of words differing only in the presence or absence of diacritical marks can be flagged, or words with a capital letter appearing after the first letter, words with no vowels, etc.
Please contact me (kscanne at gmail dot com) if you are interested in applying these techniques to a new language.
© Cóipcheart/Copyright 2004 Kevin P. Scannell
