Corpus building for minority languages:
Applications
Grammar Checking
The 100 million+ words of Irish downloaded by the crawler have been instrumental in the development of my grammar checker An Gramadóir, and many other NLP applications for Irish. Other language groups have use the Crúbadán data for developing grammar checkers, including for Afrikaans (Petri Jooste), Breton (Thierry Vignaud), Cornish (Edi Werner), Occitan (Bruno Gallart), Walon (Pablo Saratxaga)
Diacritic Restoration
Using Unicode it is possible to create electronic documents in most languages with all of the proper diacritical marks and extended characters. Nevertheless, for various reasons speakers of many languages do not do this when writing emails or blogs or producing other documents for consumption on the web. In Lingala, for example, there are tone markings as well as two extended vowels, the "open e" (ɛ), and the "open o" (ɔ). On the web, tone marks are generally omitted and these vowels are written as "e" and "o" respectively, so "abɔkɔ́lɛ́kɛ́" becomes "abokoleke". This can cause ambiguities for people reading Lingala texts, and also limits the usefulness of web texts for statistical purposes. To improve this situation, I wrote a script called "charlifter" that performs statistical diacritic restoration on web texts. This greatly enhances the usefulness of the Crúbadán corpora. The charlifter is also an application of the web crawler, in that the statistical language models it uses are created from the (rare) texts found by the crawler that use diacritical marks and extended characters correctly. My student Michael Schade has written a Firefox add-on called accentuate.us that allows one to use this technology anywhere on the web: email, blogs, chats, etc.
Language Recognition
The n-gram statistics gathered from the corpora for each language provide a powerful and effective language recognition algorithm. Of course particular care must be given to language pairs with very similar n-gram profiles; see the Language Similarity Table for more on this. The full corpus of Crúbadán 3-grams is now available (under the GPLv3) as part of the Natural Language Toolkit (NLTK); see their corpus download page.
Hearing Testing
I've provided Mongolian corpus data to Drs. Richard Harris and Shawn Nissen at Brigham Young University for their project Development of digitally recorded Mongolian Speech Audiometry Materials, the aim of which is to produce low-cost hearing tests for Mongolian speakers. I provided additional material for work-in-progress on Samoan, Tagalog, and several South African languages. See the Master's Theses based on this work here and here.
Dasher
Dasher is a free software package developed at the University of Cambridge that allows efficient text-entry without a keyboard. It uses a language model trained on text corpora to help it make predictions; the Dasher developers are training 129 new language models using the An Crúbadán corpora.
Lexicography
The Welsh corpus is being used by the University of Wales Dictionary of the Welsh Language Geiriadur Prifysgol Cymru. There is a mention of the corpus work here. Other lexicographical projects I've helped:
- focloir.ie, the new English-Irish dictionary project
- L. S. Gogan's dictionary
- kasahorow.org
- Is Iomaí Duine ag Dia, Dennis King et al
- Foclóir Gaeilge-Fraincis le Loig Cheveau
- SketchEngine
Spell Checking
New word lists
I've written a series of statistical filters that can be applied to the web corpora to generate word lists that speed the process of developing a new spell checker. These techniques have been applied to create the following spell checking packages:
- hunspell-as, (Assamese). Joint work with Amitakhya Phukan.
- aspell-az, hunspell-az (Azerbaijani). Joint work with Metin Amiroff.
- Mozilla (Bosnian). With Mirsad Čirkić, based on earlier work of Ninoslav Jurković, Samir Ribić, and Vedran Ljubović.
- aspell-csb, hunspell-csb (Kashubian). Joint work with Roman Drzeżdżon and Piotr Formella.
- Mozilla (Diola-Kasa). Joint work with Outi Sane.
- Mozilla (Diola-Fogny). Joint work with Outi Sane.
- aspell-fy, hunspell-fy (Frisian). Joint work with Eeltje de Vries.
- aspell-ga, ispell-ga, hunspell-ga, Mozilla (Irish).
- Mozilla, hunspell-gd (Scottish Gaelic). Joint work with Michael Bauer.
- aspell-gv (Manx Gaelic). Using earlier work of Alastair McKinstry.
- Mozilla (Hawaiian).
- aspell-hil, Mozilla (Hiligaynon). Joint work with Francis Dimzon.
- hunspell-ht, Mozilla (Haitian Creole). Joint work with Jean Came Poulard and LogiPam.
- aspell-ku, ispell-ku, hunspell-ku, Mozilla (Kurdish). Joint work with Erdal Ronahi and Rêzan Tovjîn.
- aspell-ky (Kirghiz). Joint work with Ilyas Bakirov.
- hunspell-ln, Mozilla (Lingala). Joint work with Denis Jacquerye.
- aspell-mg, hunspell-mg (Malagasy). Joint work with Rado Ramarotafika.
- aspell-mn (Mongolian). Joint work with Sanlig Badral. See the announcement (in Mongolian). I'm "Профессор Доктор Кэвин Сканнелл".
- aspell-ny, hunspell-ny (Chichewa). Joint work with Soyapi Mumba and Edmond Kachale.
- hunspell-om (Oromo). Joint work with Belayneh Melka and Dawit Boka.
- aspell-rw, hunspell-rw (Kinyarwanda). Joint work with Steve Murphy and Philibert Ndandali.
- Mozilla (Songhay). Joint work with Abdoul Cisse and Mohomodou Houssouba.
- hunspell-so (Somali). Joint work with Mohamed I. Mursal. Packaged as a Mozilla add-on. See the announcement (English).
- aspell-tet, hunspell-tet (Tetum). Joint work with Peter Gossner.
- aspell-tk, Mozilla (Turkmen). Joint work with Jumamurat Bayjan.
- aspell-tl, hunspell-tl (Tagalog). Joint work with Ramil Sagum.
- aspell-tn, hunspell-tn (Setswana). Joint work with Thapelo Otlogetswe.
- hunspell-tpi, Mozilla (Tok Pisin). Joint work with Helge Søgaard.
- aspell-xh, hunspell-xh (Xhosa). Crúbadán word list is the basis for the translate.org.za spell checker.
- hunspell-zu (Zulu, experimental). Crúbadán word list is the basis for the translate.org.za spell checker, which includes a rich affix file by Friedel Wolff.
Abandoned projects or works in progress
Please contact me if you speak one of these languages and would be willing to help.
- Balochi, with Mostafa Daneshvar.
- Chechen, with Sarah Slye, Steve Massey, et al
- Chhattisgarhi, with Ravishankar Shrivastava.
- Cornish, with Edi Werner and Paul Bowden.
- Dzongkha, with Tshering Cigay Dorji.
- Guaraní, with Iván Prieto Corvalán.
- Hausa, with Mustapha Abubakar.
- Igbo, with Chinedu Uchechukwu and Ogechi Nnadi.
- Itzgründisch, with Sabine Emmy Eller.
- Kapampangan, with José Navarro.
- Kikongo, with Anderson Sunda-Meya.
- Limburgish, with Kenneth Rohde Christiansen.
- Luganda, with San Emmanuel James and Jackson Ssekiryango.
- Marshallese, with Marco Mora.
- Nawat, with Alan King.
- Papiamento, with Peter M. Damiana.
- Samoan, with Chris Bickers.
- Sindhi, with Abdul Rahim Nizamani.
- Sundanese, with Mang Jamal.
- Tahitian, with Christin Livine.
- Tigrinya and Tigré, with Merhawie Woldezion.
- Tongan, with Brian Romanowski.
- Yoruba, with Tope Faro.
Abandoned projects, word lists now available elsewhere
- Asturian, with Ricardo Mones Lastra (update: OpenOffice.org extension available here).
- Basque, with Alberto Fernández (update: hunspell package now available from euskadi.net).
- Bislama, with Eric Brandell (update: GPL word lists available from swtech.com.au).
- Friulian, with Andrea Tami (update: extensive word list now available from digilander.libero.it).
Improved word lists
The Swahili corpus was used to enhance the word list originally created by Jason Githeko of Egerton University, Njoro, Kenya. Read the Press Release.
I have also provided frequency lists to the teams developing the Armenian and Kazakh spell checkers.
Data for Northern Sámi were provided to the Divvun project, who are developing open source morphological analyzers, spell checkers, and hyphenators.
I am hoping to use the Quechua corpora to help the Runasimipi project, which is dedicated to the development of language technology and localized software for speakers of Quechua. We will need some help to train the language recognizer to distinguish the many dialects found on the web; please contact me if you're interested in helping.
I provided frequency lists that underlie the open source spell checkers for Low Saxon (Heiko Evermann), Luxembourgeois (Michel Weimerskirch), Shona (Boniface Manyame), and Tamil (Elanjelian Venugopal et al).
Jacob Sparre Andersen has a powerful email-based editing system in place for Faroese. We've recently succeeded in arranging things so that our programs work together nicely: candidate words are extracted by An Crúbadán from the Faroese corpus using the techniques described below; these are sent automatically to Jacob's server which prioritizes them and sends them via email to volunteer editors. Each night An Crúbadán can then download the modified word lists (of verified-correct and verified-incorrect words) which are in turn used to improve the crawler's language model, allowing more documents to be harvested and new words to be suggested etc. etc. If you'd like to set up a similar system for your language using the An Crúbadán corpora, first download and install Jacob's "speling.org" system, available from his site, and then contact me.
Other Projects
I've provided data to hundreds of researchers working on computational or pure linguistics research projects for many languages. I used to track them all here but that's become more trouble than it's worth. Here is a list of some of the applications in any case:
- Dialect discrimination
- Computational Morphology
- Syntactic analysis (computational and theoretical)
- Lemmatization and IR
- Language ID
- Optical Character Recognition
- Sociolinguistics and social media
- Language learning
- Word sense disambiguation
- Predictive text
- Comparative phonology
- Lexicography
- Machine translation
- Selectional preferences
- Crossword generation
- POS tagging
- Speech recognition
- Speech synthesis
- Psycholinguistics
- Semantic networks
- Diacritic restoration
- Spell checking
Please contact me (kscanne at gmail dot com) if you are interested in applying these techniques to a new language.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
