Natural Language Processing
Quick links: aimsigh | gramadoir | litreoir | crubadan | ccgb | lsg | gnu | fleiscin
Lectures, Papers, and Conferences
- Semi-automated construction of semantic networks using web corpora, paper presented at the "Words, Texts and Dictionaries" conference, University of Wales Centre for Advanced Welsh and Celtic Studies, Aberystwyth, October 18, 2008.
- Open source language technology using statistical methods, survey talk for the St. Louis Unix Users Group, August 21, 2008. (ODP, PDF)
- An Gramadóir: A grammar-checking framework for the Celtic languages and its applications, paper presented at the 14th annual NAACLT conference, Madog Center for Welsh Studies, University of Rio Grande, June 12-13, 2008. (PDF)
- I am on the program committe for the 4th Web as Corpus workshop in Marrakech, Morocco, June 1st, 2008.
- I was an invited speaker and served on the program committee for the 3rd Web as Corpus Workshop (WAC3) in Louvain-la-Neuve, Belgium, September 15-16, 2007. (Main talk: ODP, PDF. Panel discussion: ODP, PDF)
- The Crúbadán Project: Corpus building for under-resourced languages, Cahiers du Cental 4 (2007), pp5-15, C. Fairon, H. Naets, A. Kilgarriff, G-M de Schryver, eds., "Building and Exploring Web Corpora", Proceedings of the 3rd Web as Corpus Workshop in Louvain-la-Neuve, Belgium, September 2007. (PDF)
- Translations of free software into Irish (with Séamus Ó Ciardhuáin), Translation Ireland 17 (2007), no. 2, 19-30. Special issue on "Translation and Irish in the 21st Century". (PDF)
- Implementing NLP Projects for Non-Central Languages: Instructions for Funding Bodies, Strategies for Developers (with Oliver Streiter and Mathias Stuflesser), Machine Translation 20 (2006), no. 4, 267-289. (PDF)
- Machine translation for closely related language pairs, Proceedings of the Workshop "Strategies for developing machine translation for minority languages" at LREC 2006, Genoa, Italy, May 2006, pp103-107. (PDF)
- I was on the program committee for the workshop TAL et langues peu dotées at TALN 2005 in Dourdan, France, June 6-10, 2005.
- I was also on the program committee for a similar workshop that was held at the ACL meeting in Ann Arbor, June 29-30, 2005: Building and using parallel texts for languages with scarce resources.
- Applications of parallel corpora to the development of monolingual language technologies. (PDF)
- Automatic thesaurus generation for minority languages: an Irish example, Actes de la 10e conférence TALN à Batz-sur-Mer du 11 au 14 Juin 2003, volume 2, pp 203-212. Paper presented at the workshop Traitement Automatique des Langues Minoritaires et des Petites Langues. (PDF)
- Hyphenation patterns for minority languages, TUGboat 24 (2003), no. 2, 236-239. (PDF)
- Global Software, Lecture on internationalization for undergraduates at Saint Louis University, April 8th, 2002. (PDF Slides)
Software Projects
I have written and actively maintain a number of open source software packages in support of minority languages and other languages with limited computational resources.
Corpora, web-crawling, search engines
- aimsigh.com. Linguistically sophisticated search.
- An Crúbadán. A web crawler for building minority language corpora automatically.
- Corpas Comhthreomhar Gaeilge-Béarla. An aligned parallel corpus of Irish and English texts.
- Internet Corpus of Welsh. Contains approximately 100 million words of Welsh. Now in use by the University of Wales Welsh Dictionary.
- Other Corpora. Asturian, Aymara, Basque, Breton, ... Venda, Walloon, Zulu.
Spell Checking and Grammar Checking
- An Gramadóir. An open source grammar checking engine that works with vim, emacs, and OpenOffice.
- GaelSpell. Irish spellcheckers for multiple platforms built from a single, high-quality database.
- Aspell. I'm using web crawling and statistical methods to develop new spell checking packages for a number of minority languages.
Lexicography
- Líonra Séimeantach na Gaeilge. An Irish language semantic network ("WordNet"), available as a traditional thesaurus, or via a cool 3D browser.
- English-Irish-Afrikaans dictionary. Written with Darrin Speegle.
- Hyphenation. An Irish hyphenation dictionary adapted for use with TeX/LaTeX, Scribus, OpenOffice, etc.
Machine Translation
- ga2gd. Robust machine translation between closely-related languages.
Human Translation
- GNU/Linux. Ever wonder how to say "in compatibility mode, the last two arguments must be offsets" in Irish? I am team leader at the GNU Translation Project.
- OpenOffice.org. I'm also coordinating the effort to translate OpenOffice.org into Irish.
- Mozilla. Localization of the Firefox web browser, Thunderbird email handler, and Sunbird calendar into Irish.
- KDE. Joint work with Séamus Ó Ciardhuáin.