Natural Language Processing
Quick links: aimsigh | gramadoir | litreoir | crubadan | ccgb | lsg | gnu | fleiscin
Lectures, Papers, and Conferences
- Standardization of corpus texts for the New English-Irish Dictionary, paper presented at the 15th annual NAACLT conference, New York, May 22, 2009. (PDF)
- Statistical language processing (and Perl), survey talk for the St. Louis Perl Mongers, January 21, 2009. (PDF)
- I am on the program committee for the 5th Web as Corpus workshop in San Sebastián, Spain, September 7th, 2009.
- Free software for indigenous languages, to appear in Native Language Network, a publication of the Indigenous Language Institute. (PDF)
- Semi-automated construction of semantic networks using web corpora, paper presented at the "Words, Texts and Dictionaries" conference, University of Wales Centre for Advanced Welsh and Celtic Studies, Aberystwyth, October 18, 2008. (ODP, PDF)
- Open source language technology using statistical methods, survey talk for the St. Louis Unix Users Group, August 21, 2008. (ODP, PDF)
- An Gramadóir: A grammar-checking framework for the Celtic languages and its applications, paper presented at the 14th annual NAACLT conference, Madog Center for Welsh Studies, University of Rio Grande, June 12-13, 2008. (PDF)
- I am on the program committee for the 4th Web as Corpus workshop in Marrakech, Morocco, June 1st, 2008.
- Language technology from scratch. Video lecture at the Vox Humanitatis conference on technology for minority languages held as part of the 41st Festival of the Piedmont Region, Cherasco, Italy, May 2008. (PDF Transcript)
- I was an invited speaker and served on the program committee for the 3rd Web as Corpus Workshop (WAC3) in Louvain-la-Neuve, Belgium, September 15-16, 2007. (Main talk: ODP, PDF. Panel discussion: ODP, PDF)
- The Crúbadán Project: Corpus building for under-resourced languages, Cahiers du Cental 4 (2007), pp5-15, C. Fairon, H. Naets, A. Kilgarriff, G-M de Schryver, eds., "Building and Exploring Web Corpora", Proceedings of the 3rd Web as Corpus Workshop in Louvain-la-Neuve, Belgium, September 2007. (PDF)
- Translations of free software into Irish (with Séamus Ó Ciardhuáin), Translation Ireland 17 (2007), no. 2, 19-30. Special issue on "Translation and Irish in the 21st Century". (PDF)
- Implementing NLP Projects for Non-Central Languages: Instructions for Funding Bodies, Strategies for Developers (with Oliver Streiter and Mathias Stuflesser), Machine Translation 20 (2006), no. 4, 267-289. (PDF)
- Machine translation for closely related language pairs, Proceedings of the Workshop "Strategies for developing machine translation for minority languages" at LREC 2006, Genoa, Italy, May 2006, pp103-107. (PDF)
- I was on the program committee for the workshop TAL et langues peu dotées at TALN 2005 in Dourdan, France, June 6-10, 2005.
- I was also on the program committee for a similar workshop that was held at the ACL meeting in Ann Arbor, June 29-30, 2005: Building and using parallel texts for languages with scarce resources.
- Applications of parallel corpora to the development of monolingual language technologies. (PDF)
- Automatic thesaurus generation for minority languages: an Irish example, Actes de la 10e conférence TALN à Batz-sur-Mer du 11 au 14 Juin 2003, volume 2, pp 203-212. Paper presented at the workshop Traitement Automatique des Langues Minoritaires et des Petites Langues. (PDF)
- Hyphenation patterns for minority languages, TUGboat 24 (2003), no. 2, 236-239. (PDF)
- Global Software, Lecture on internationalization for undergraduates at Saint Louis University, April 8th, 2002. (PDF Slides)
Software Projects
I have written and actively maintain a number of open source software packages in support of minority languages and other languages with limited computational resources.
Corpora, web-crawling, search engines
- aimsigh.com. Linguistically sophisticated search.
- An Crúbadán. A web crawler for building minority language corpora automatically.
- Corpas Comhthreomhar Gaeilge-Béarla. An aligned parallel corpus of Irish and English texts.
- Internet Corpus of Welsh. Contains approximately 100 million words of Welsh. Now in use by the University of Wales Welsh Dictionary.
- Other Corpora. Asturian, Aymara, Basque, Breton, ... Venda, Walloon, Zulu.
Spell Checking and Grammar Checking
- An Gramadóir. An open source grammar checking engine that works with vim, emacs, and OpenOffice.
- charlifter. A script that performs statistical diacritic restoration. Pre-trained models are available for a number of languages (Irish, Lingala, Hawaiian, Samoan, ...)
- GaelSpell. Irish spellcheckers for multiple platforms built from a single, high-quality database.
- Aspell. I'm using web crawling and statistical methods to develop new spell checking packages for a number of minority languages.
Lexicography
- Foclóir Nua Béarla-Gaeilge (The New English-Irish Dictionary). I am helping prepare Irish texts written in pre-standard orthography for indexing and inclusion in the project corpus.
- Líonra Séimeantach na Gaeilge. An Irish language semantic network ("WordNet"), available as a traditional thesaurus, or via a cool 3D browser.
- English-Irish-Afrikaans dictionary. Written with Darrin Speegle.
- Hyphenation. An Irish hyphenation dictionary adapted for use with TeX/LaTeX, Scribus, OpenOffice, etc.
Machine Translation
- ga2gd. Robust machine translation between closely-related languages. See student projects below.
Human Translation
- GNU/Linux. Ever wonder how to say "in compatibility mode, the last two arguments must be offsets" in Irish? I am team leader at the GNU Translation Project.
- OpenOffice.org. I'm also coordinating the effort to translate OpenOffice.org into Irish.
- Mozilla. Localization of the Firefox web browser, Thunderbird email handler, and Sunbird calendar into Irish.
- KDE. Joint work with Séamus Ó Ciardhuáin.
Selected Student Projects
- Port of ga2gd to Apertium. Project by Sean Burke (University of Montana undergraduate) for the Google Summer of Code, 2009.
- Amharic-English Cross-Lingual Information Retrieval: A Corpus-Based Approach. By Aynalem Tesfaye, as part of her MS at Addis Ababa University, Department of Information Science, 2009.
- Language recognition and corpus building for Bosnian, Serbian, and Croatian. With Eldar Murselovic, SLU undergraduate.
- Bayesian grammar checker for Irish. Senior design project by Rich Barmeier.
- Example-based machine translation of open-source software packages. Senior design project by Regina Lennox, 2009.
- Who's afraid of the big bad Whorf?. 4th grade science fair project on linguistic relativity by my daughter Maddy, 2009.
- Port of ga2gd to Apertium, and development of Irish-Manx lexicon (ga2gv). Project by Joshua Glatt, Washington University undergraduate, 2008-2009.
- Translation recognition algorithm. Senior design project by Michael Henderson, 2008.
- Kinyarwanda morphology. Developed by Jackson Muhirwe as part of his Ph.D. work at Makarere University in Kampala, Uganda, 2007.
- Bosnian word-sense disambiguation. Senior design project by Jasmin Custic, 2007.