Corpus building for minority languages
Kevin P. Scannell
Summary
This page contains a description of my web crawling software An Crúbadán. I've written this page for researchers in natural language processing (NLP) interested in corpus development and also for webmasters of sites who have seen the software show up in their logs. In short, the goal of the software is the automatic development of large text corpora for minority languages.
Statistical techniques are a key part of most modern natural language processing systems. Unfortunately, such techniques require the existence of large bodies of text, and in the past corpus development has proved to be quite expensive. As a result, substantial corpora exist primarily for languages like English, French, German, etc. where there is a market-driven need for NLP tools.
My software is designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources. Initially it was deployed for the six Celtic languages, but more recently I've added support for a number of other languages from all parts of the world. You can find an up-to-date list of languages and the corpus statistics for each on the Status Page. There is also information on tools developed using these corpora on the Applications Page.
I gave a presentation on this work at the WAC3 conference in Louvain-la-Neuve, Belgium in September of 2007. I am grateful to the organizers Cédrick Fairon, Adam Kilgarriff, and Gilles-Maurice de Schryver for the invitation and to the Université Catholique de Louvain for financial support that made the trip possible. You can read the slides from my main talk and also some remarks I made during the WAC panel discussion.
The remarkable image to the left was created by Michael Cysouw of the Max Planck Institut für evolutionäre Anthropologie in Leipzig. It illustrates the tree of language relationships as reconstructed from the 3-gram table using the neighbor-joining algorithm of Saitou and Nei. Several language families are reproduced quite accurately, despite the naïveté of the approach.
The word crúbadán means literally "crawler" in Irish, but with the additional (appropriate in this context) connotation of unwanted or clumsy "pawing", from the root crúb ("paw").
Webmasters
- The downloaded content will be used for academic purposes only and will not be republished in any form.
- The software respects
robots.txt. - Any given file should be downloaded just once.
- The software is designed to restrict downloads to sites having appropriate content in minority languages. If you are seeing many hits in your web logs and your site has, say, only English content, please let me know.
Thanks
Many individuals have contributed their linguistic expertise to this project by helping me "tune" individual language models. A complete list of contributors can be found on the status page, under the heading "Contact(s)". I am grateful to you all for your generous help and unflagging enthusiasm for this project!
Thanks also to Patric Müller for his help in getting vilistextum to process utf-8 correctly, and to Kevin Atkinson for a number of useful suggestions. If you're interested in Ethiopic scripts, Biniam Gebremichael has managed to implement a similar system for Tigrigna and Amharic.
How it works
Initially a small collection of "seed" texts are fed to the crawler (a few hundred words of running text have been sufficient in practice). Queries combining words from these texts are generated and passed to the Google API which returns a list of documents potentially written in the target language. These are downloaded, processed into plain text, and formatted. A combination of statistical techniques bootstrapped from the initial seed texts (and refined as more texts are added to the database) is used to determine which documents (or sections thereof) are written in the target language. The crawler then recursively follows links contained within documents that are in the target language. When these run out, the entire process is repeated, with a new set of Google queries generated from the new, larger corpus.
© Cóipcheart/Copyright 2004 Kevin P. Scannell
