Corpus building for minority languages

Kevin P. Scannell


This page contains a description of my web crawling software An Crúbadán. I've written this page for researchers in natural language processing (NLP) interested in corpus development and also for webmasters of sites who have seen the software show up in their logs. In short, the goal of the software is the automatic development of large text corpora for under-resourced languages.

Statistical techniques are a key part of most modern natural language processing systems. Unfortunately, such techniques require the existence of large bodies of text, and in the past corpus development has proved to be quite expensive. As a result, substantial corpora exist primarily for languages like English, French, German, etc. where there is a market-driven need for NLP tools.

My software is designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources. Initially it was deployed for the six Celtic languages, but more recently I've added support for a number of other languages from all parts of the world. You can find an up-to-date list of languages and the corpus statistics for each on the Status Page. There is also information on tools developed using these corpora on the Applications Page.

I gave a presentation on this work at the WAC3 conference in Louvain-la-Neuve, Belgium in September of 2007. Here is the conference paper, which is the one to cite if you make use of this work: The Crúbadán Project: Corpus building for under-resourced languages. I am grateful to the organizers Cédrick Fairon, Adam Kilgarriff, and Gilles-Maurice de Schryver for the invitation and to the Université Catholique de Louvain for financial support that made the trip possible. You can read the slides from my main talk and also some remarks I made during the WAC panel discussion.

The remarkable image to the left was created by Michael Cysouw of the Max Planck Institut für evolutionäre Anthropologie in Leipzig. It illustrates the tree of language relationships as reconstructed from the 3-gram table using the neighbor-joining algorithm of Saitou and Nei. Several language families are reproduced quite accurately, despite the naïveté of the approach. Here's an updated version of the image generated in December 2011, showing 1000 languages.

The word crúbadán means literally "crawler" in Irish, but with the additional (appropriate in this context) connotation of unwanted or clumsy "pawing", from the root crúb ("paw"). Several people have asked me how it is pronounced - you can now listen to the word as it's spoken by the wonderful Irish speech synthesizer


For copyright reasons, I am not allowed to make the full-text corpora available for download from this site. But send me an email if you're interested in a particular language and there's plenty of data I am free to share (frequency lists, n-grams, etc.).

For anyone interested in language identification, I've made the full corpus of character 3-grams available under the GPLv3. See the list of NLTK corpora.



Many individuals have contributed their linguistic expertise to this project by helping me "tune" individual language models. A complete list of contributors can be found on the status page, under the heading "Contact(s)". I am grateful to you all for your generous help and unflagging enthusiasm for this project!

Thanks also to Patric Müller for his help in getting vilistextum to process utf-8 correctly, and to Kevin Atkinson for a number of useful suggestions. If you're interested in Ethiopic scripts, Biniam Gebremichael has managed to implement a similar system for Tigrigna and Amharic.

This material is based upon work supported by the National Science Foundation under Grant Number 1159174. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.