Corpus building for minority languages:
Applications
Grammar Checking
The 100 million+ words of Irish downloaded by the crawler have been instrumental in the development of my grammar checker An Gramadóir, and many other NLP applications for Irish. Other language groups have use the Crúbadán data for developing grammar checkers, including for Afrikaans (Petri Jooste), Breton (Thierry Vignaud), Cornish (Edi Werner), Occitan (Bruno Gallart), Walon (Pablo Saratxaga)
Diacritic Restoration
Using Unicode it is possible to create electronic documents in most languages with all of the proper diacritical marks and extended characters. Nevertheless, for various reasons speakers of many languages do not do this when writing emails or blogs or producing other documents for consumption on the web. In Lingala, for example, there are tone markings as well as two extended vowels, the "open e" (ɛ), and the "open o" (ɔ). On the web, tone marks are generally omitted and these vowels are written as "e" and "o" respectively, so "abɔkɔ́lɛ́kɛ́" becomes "abokoleke". This limits the usefulness of web texts for statistical purposes. To improve this situation, I wrote a script called "charlifter" that performs statistical diacritic restoration on web texts. This greatly enhances the usefulness of the Crúbadán corpora. The charlifter is also an application of the web crawler, in that the statistical language models it uses are created from the (rare) texts found by the crawler that use diacritical marks and extended characters correctly.
Language Recognition
The n-gram statistics gathered from the corpora for each language provide a powerful and effective language recognition algorithm. Of course particular care must be given to language pairs with very similar n-gram profiles; see the Language Similarity Table for more on this.
Hearing Testing
I've provided Mongolian corpus data to Drs. Richard Harris and Shawn Nissen at Brigham Young University for their project Development of digitally recorded Mongolian Speech Audiometry Materials, the aim of which is to produce low-cost hearing tests for Mongolian speakers. I provided additional material for work-in-progress on the Samoan language. See the Master's Theses based on this work here and here.
Dasher
Dasher is a free software package developed at the University of Cambridge that allows efficient text-entry without a keyboard. It uses a language model trained on text corpora to help it make predictions; the Dasher developers are training 129 new language models using the An Crúbadán corpora.
Machine Translation
This is one of my primary interests. I've supplied web-crawled monolingual and parallel corpus data to several groups working on machine translation involving at least one minority or under-resourced language:
- Breton (Fran Tyers, Apertium project)
- Faroese (Fran Tyers, Apertium project)
- Indonesian (Ferli Deni Iskandar)
- Irish (Andy Way, Dublin City University)
- Maori (Richard A. O'Keefe, University of Otago)
- Nahuatl (Rada Mihalcea, University of North Texas)
- Oromo (Sisay Adugna, Addis Ababa University)
- Quechua (Rada Mihalcea, University of North Texas)
- Scottish Gaelic (Fran Tyers, Apertium project)
- Swahili (Gabriele Maria Brandolini)
Lexicography
The Welsh corpus is being used by the University of Wales Dictionary of the Welsh Language Geiriadur Prifysgol Cymru. There is a mention of the corpus work here. Other lexicographical projects I've helped:
- focloir.ie, the new English-Irish dictionary project
- kasahorow.org
- Is Iomaí Duine ag Dia, Dennis King et al
- Loig Cheveau (Irish dictionary)
- SketchEngine
Other Projects
Assorted other projects in computational linguistics:
- Houssein Ahmen, University of Bordeaux (Somali POS tagging)
- Max Bane, University of Chicago (morphological complexity of creole languages)
- Erwin Chan, University of Pennsylvania (morphological induction for Galician)
- Cris Daniluk (language recognition)
- Dmitry Davidov, Hebrew University in Jerusalem (unsupervised learning of semantic categories)
- Guy De Pauw et al, University of Antwerp (statistical diacritic replacement)
- Chris Harvey, Indigenous Languages Institute (data for Ojibway and Cree)
- Baden Hughes, University of Melbourne (Tok Pisin)
- Terry Martin, Queensland University of Technology (Indonesian)
- Michal Boleslav Měchura, Dublin City University (study of selectional preferences)
- Muhirwe Jackson, Makarere University (morphological analysis of Kinyarwanda)
- Seosamh Mac Muirí, University of Limerick (Irish frequency lists)
- Alexandre Papadopoulos, University College Cork, and Raphael Finkel, University of Kentucky (crossword generation)
- Hariharan Ramamurthy (Telugu speech recognition)
- Bríd Stack et al, teacs.ie (Irish predictive text)
- Pranava Swaroop, Indian Institute of Science (Tamil and Kannada taggers)
- Toma Tasovac, Princeton University (Serbian semantic network)
- Isabella Ties, European Academy (Ladin)
- Trond Trosterud, Universitetet I Tromsø (Kalaallisut parsing)
- Joshua Verano, Cebu Institute of Technology (Cebuano NLP tools)
- Pauline Welby, CNRS Université de Provence (Irish initial mutations)
- Friedel Wolff, translate.org.za (language recognition data for South African languages)
Spell Checking
New word lists
I've written a series of statistical filters that can be applied to the web corpora to generate word lists that speed the process of developing a new spell checker. These techniques have been applied to create the following spell checking packages:
- hunspell-as, (Assamese). Joint work with Amitakhya Phukan.
- aspell-az, hunspell-az (Azerbaijani). Joint work with Metin Amiroff.
- aspell-csb, hunspell-csb (Kashubian). Joint work with Roman Drzeżdżon and Piotr Formella.
- aspell-fy, ispell-fy, hunspell-fy (Frisian). Joint work with Eeltje de Vries.
- aspell-ga, ispell-ga, hunspell-ga (Irish).
- aspell-gd, ispell-gd, hunspell-gd (Scottish Gaelic).
- aspell-gv, ispell-gv (Manx Gaelic). Using earlier work of Alastair McKinstry.
- hunspell-haw (Hawaiian).
- aspell-hil, ispell-hil (Hiligaynon). Joint work with Francis Dimzon.
- hunspell-ht (Haitian Creole). Joint work with Jean Came Poulard and LogiPam.
- aspell-ku, ispell-ku, hunspell-ku (Kurdish). Joint work with Erdal Ronahi and Rêzan Tovjîn.
- aspell-ky (Kirghiz). Joint work with Ilyas Bakirov.
- hunspell-ln (Lingala). Joint work with Denis Jacquerye.
- aspell-mg, ispell-mg, hunspell-mg (Malagasy). Joint work with Rado Ramarotafika.
- aspell-mn (Mongolian). Joint work with Sanlig Badral. See the announcement (in Mongolian). I'm "Профессор Доктор Кэвин Сканнелл".
- aspell-ny, ispell-ny, hunspell-ny (Chichewa). Joint work with Soyapi Mumba and Edmond Kachale.
- hunspell-om (Oromo). Joint work with Belayneh Melka and Dawit Boka.
- aspell-rw, ispell-rw, hunspell-rw (Kinyarwanda). Joint work with Steve Murphy and Philibert Ndandali.
- hunspell-son (Songhay). Joint work with Abdoul Cisse and Mohomodou Houssouba.
- hunspell-so (Somali). Joint work with Mohamed I. Mursal. Packaged as a Mozilla add-on. See the announcement (English).
- aspell-tet, ispell-tet, hunspell-tet (Tetum). Joint work with Peter Gossner.
- aspell-tk, hunspell-tk (Turkmen). Joint work with Jumamurat Bayjan.
- aspell-tl, ispell-tl, hunspell-tl (Tagalog). Joint work with Ramil Sagum.
- aspell-tn, ispell-tn, hunspell-tn (Setswana). Joint work with Thapelo Otlogetswe.
Abandoned projects or works in progress
Please contact me if you speak one of these languages and would be willing to help.
- Asturian, with Ricardo Mones Lastra, Marcos Costales.
- Balochi, with Mostafa Daneshvar.
- Bislama, with Eric Brandell.
- Bosnian, with Eldar Murselovic.
- Chhattisgarhi, with Ravishankar Shrivastava.
- Cornish, with Edi Werner and Paul Bowden.
- Diola, with Outi Sane.
- Dzongkha, with Tshering Cigay Dorji.
- Guaraní, with Iván Prieto Corvalán.
- Hausa, with Mustapha Abubakar.
- Igbo, with Chinedu Uchechukwu and Ogechi Nnadi.
- Itzgründisch, with Sabine Emmy Eller.
- Kikongo, with Anderson Sunda-Meya.
- Limburgish, with Kenneth Rohde Christiansen.
- Luganda, with San Emmanuel James and Jackson Ssekiryango.
- Papiamento, with Peter M. Damiana.
- Samoan, with Dr. Hans Zarkov, formerly of NASA.
- Secwepemc, with Neskie Manuel.
- Sundanese, with Mang Jamal.
- Tahitian, with Christin Livine.
- Tigrinya and Tigré, with Merhawie Woldezion.
- Tongan, with Brian Romanowski.
- Yoruba, with Tope Faro.
Abandoned projects, word lists now available elsewhere
- Basque, with Alberto Fernández (update: hunspell package now available from euskadi.net).
- Friulian, with Andrea Tami (update: extensive word list now available from digilander.libero.it).
Improved word lists
The Swahili corpus was used to enhance the word list originally created by Jason Githeko of Egerton University, Njoro, Kenya. Read the Press Release.
I have also provided frequency lists to the teams developing the Armenian and Kazakh spell checkers.
Data for Northern Sámi were provided to the Divvun project, who are developing open source morphological analyzers, spell checkers, and hyphenators.
I am hoping to use the Quechua corpora to help the Runasimipi project, which is dedicated to the development of language technology and localized software for speakers of Quechua. We will need some help to train the language recognizer to distinguish the many dialects found on the web; please contact me if you're interested in helping.
I provided frequency lists that underlie the open source spell checkers for Low Saxon (Heiko Evermann) and Luxembourgeois (Michel Weimerskirch).
Jacob Sparre Andersen has a powerful email-based editing system in place for Faroese. We've recently succeeded in arranging things so that our programs work together nicely: candidate words are extracted by An Crúbadán from the Faroese corpus using the techniques described below; these are sent automatically to Jacob's server which prioritizes them and sends them via email to volunteer editors. Each night An Crúbadán can then download the modified word lists (of verified-correct and verified-incorrect words) which are in turn used to improve the crawler's language model, allowing more documents to be harvested and new words to be suggested etc. etc. If you'd like to set up a similar system for your language using the An Crúbadán corpora, first download and install Jacob's "speling.org" system, available from his site, and then contact me.
How it works
First, statistics measuring co-occurrence with the highest frequency words in the target language are used to filter out sections written in other languages or containing mostly noise (e.g. computer code, tabular data, etc.). The remaining text is tokenized and used to generate a word list sorted by frequency and the lowest frequency words are filtered out. Then, depending on the target language, correctly-spelled words from one or more "polluting" languages are filtered out to be checked by hand later. Usually this means English, but I also filter Dutch from the Frisian corpus, Spanish from Basque, etc. The remaining words are used to generate 3-gram statistics for the target language. These are used to flag as "suspect" any remaining words containing one or more improbable 3-grams. Additional filters specific to certain languages can be applied optionally; for instance, pairs of words differing only in the presence or absence of diacritical marks can be flagged, or words with a capital letter appearing after the first letter, words with no vowels, etc.
Please contact me (kscanne at gmail dot com) if you are interested in applying these techniques to a new language.
© Cóipcheart/Copyright 2004 Kevin P. Scannell
