Corpus building for minority languages:

Grammar Checking

The 100 million+ words of Irish downloaded by the crawler have been instrumental in the development of my grammar checker An Gramadóir, and many other NLP applications for Irish. Other language groups have use the Crúbadán data for developing grammar checkers, including for Afrikaans (Petri Jooste), Breton (Thierry Vignaud), Cornish (Edi Werner), Occitan (Bruno Gallart), Walon (Pablo Saratxaga)

Diacritic Restoration

Using Unicode it is possible to create electronic documents in most languages with all of the proper diacritical marks and extended characters. Nevertheless, for various reasons speakers of many languages do not do this when writing emails or blogs or producing other documents for consumption on the web. In Lingala, for example, there are tone markings as well as two extended vowels, the "open e" (ɛ), and the "open o" (ɔ). On the web, tone marks are generally omitted and these vowels are written as "e" and "o" respectively, so "abɔkɔ́lɛ́kɛ́" becomes "abokoleke". This can cause ambiguities for people reading Lingala texts, and also limits the usefulness of web texts for statistical purposes. To improve this situation, I wrote a script called "charlifter" that performs statistical diacritic restoration on web texts. This greatly enhances the usefulness of the Crúbadán corpora. The charlifter is also an application of the web crawler, in that the statistical language models it uses are created from the (rare) texts found by the crawler that use diacritical marks and extended characters correctly. My student Michael Schade has written a Firefox add-on called that allows one to use this technology anywhere on the web: email, blogs, chats, etc.

Language Recognition

The n-gram statistics gathered from the corpora for each language provide a powerful and effective language recognition algorithm. Of course particular care must be given to language pairs with very similar n-gram profiles; see the Language Similarity Table for more on this. The full corpus of Crúbadán 3-grams is now available (under the GPLv3) as part of the Natural Language Toolkit (NLTK); see their corpus download page.

Hearing Testing

I've provided Mongolian corpus data to Drs. Richard Harris and Shawn Nissen at Brigham Young University for their project Development of digitally recorded Mongolian Speech Audiometry Materials, the aim of which is to produce low-cost hearing tests for Mongolian speakers. I provided additional material for work-in-progress on Samoan, Tagalog, and several South African languages. See the Master's Theses based on this work here and here.


Dasher is a free software package developed at the University of Cambridge that allows efficient text-entry without a keyboard. It uses a language model trained on text corpora to help it make predictions; the Dasher developers are training 129 new language models using the An Crúbadán corpora.


The Welsh corpus is being used by the University of Wales Dictionary of the Welsh Language Geiriadur Prifysgol Cymru. There is a mention of the corpus work here. Other lexicographical projects I've helped:

Spell Checking

New word lists

I've written a series of statistical filters that can be applied to the web corpora to generate word lists that speed the process of developing a new spell checker. These techniques have been applied to create the following spell checking packages:

Abandoned projects or works in progress

Please contact me if you speak one of these languages and would be willing to help.

Abandoned projects, word lists now available elsewhere

Improved word lists

The Swahili corpus was used to enhance the word list originally created by Jason Githeko of Egerton University, Njoro, Kenya. Read the Press Release.

I have also provided frequency lists to the teams developing the Armenian and Kazakh spell checkers.

Data for Northern Sámi were provided to the Divvun project, who are developing open source morphological analyzers, spell checkers, and hyphenators.

I am hoping to use the Quechua corpora to help the Runasimipi project, which is dedicated to the development of language technology and localized software for speakers of Quechua. We will need some help to train the language recognizer to distinguish the many dialects found on the web; please contact me if you're interested in helping.

I provided frequency lists that underlie the open source spell checkers for Low Saxon (Heiko Evermann), Luxembourgeois (Michel Weimerskirch), Shona (Boniface Manyame), and Tamil (Elanjelian Venugopal et al).

Jacob Sparre Andersen has a powerful email-based editing system in place for Faroese. We've recently succeeded in arranging things so that our programs work together nicely: candidate words are extracted by An Crúbadán from the Faroese corpus using the techniques described below; these are sent automatically to Jacob's server which prioritizes them and sends them via email to volunteer editors. Each night An Crúbadán can then download the modified word lists (of verified-correct and verified-incorrect words) which are in turn used to improve the crawler's language model, allowing more documents to be harvested and new words to be suggested etc. etc. If you'd like to set up a similar system for your language using the An Crúbadán corpora, first download and install Jacob's "" system, available from his site, and then contact me.

Other Projects

I've provided data to hundreds of researchers working on computational or pure linguistics research projects for many languages. I used to track them all here but that's become more trouble than it's worth. Here is a list of some of the applications in any case:

Please contact me (kscanne at gmail dot com) if you are interested in applying these techniques to a new language.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.