Natural Language Processing
Quick links: aimsigh | gramadoir | litreoir | crubadan | ccgb | lsg | gnu | fleiscin
Lectures, Papers, and Conferences
- Statistical Unicodification of African Languages, submitted for publication. (PDF)
- Saving languages with statistics and web crawlers, talk on diacritic restoration ("charlifter") at St. Louis BarCamp, Washington University, November 7-8, 2009. (PDF)
- Vox Humanitatis podcast on machine translation for African languages, with Bèrto 'd Sèra and Martin Benjamin, November 5, 2009. (Listen now)
- Standardization of corpus texts for the New English-Irish Dictionary, paper presented at the 15th annual NAACLT conference, New York, May 22, 2009. (PDF)
- Statistical language processing (and Perl), survey talk for the St. Louis Perl Mongers, January 21, 2009. (PDF)
- I am on the program committee for the 5th Web as Corpus workshop in San Sebastián, Spain, September 7th, 2009.
- Free software for indigenous languages, to appear in Native Language Network, a publication of the Indigenous Language Institute. (PDF)
- Semi-automated construction of semantic networks using web corpora, paper presented at the "Words, Texts and Dictionaries" conference, University of Wales Centre for Advanced Welsh and Celtic Studies, Aberystwyth, October 18, 2008. (PDF)
- Open source language technology using statistical methods, survey talk for the St. Louis Unix Users Group, August 21, 2008. (PDF)
- An Gramadóir: A grammar-checking framework for the Celtic languages and its applications, paper presented at the 14th annual NAACLT conference, Madog Center for Welsh Studies, University of Rio Grande, June 12-13, 2008. (PDF)
- I am on the program committee for the 4th Web as Corpus workshop in Marrakech, Morocco, June 1st, 2008.
- Language technology from scratch. Video lecture at the Vox Humanitatis conference on technology for minority languages held as part of the 41st Festival of the Piedmont Region, Cherasco, Italy, May 2008. (PDF Transcript)
- I was an invited speaker and served on the program committee for the 3rd Web as Corpus Workshop (WAC3) in Louvain-la-Neuve, Belgium, September 15-16, 2007. (Main talk: PDF. Panel discussion: PDF)
- The Crúbadán Project: Corpus building for under-resourced languages, Cahiers du Cental 4 (2007), pp5-15, C. Fairon, H. Naets, A. Kilgarriff, G-M de Schryver, eds., "Building and Exploring Web Corpora", Proceedings of the 3rd Web as Corpus Workshop in Louvain-la-Neuve, Belgium, September 2007. (PDF)
- Translations of free software into Irish (with Séamus Ó Ciardhuáin), Translation Ireland 17 (2007), no. 2, 19-30. Special issue on "Translation and Irish in the 21st Century". (PDF)
- Implementing NLP Projects for Non-Central Languages: Instructions for Funding Bodies, Strategies for Developers (with Oliver Streiter and Mathias Stuflesser), Machine Translation 20 (2006), no. 4, 267-289. (PDF: published version, 28 pages, PDF: abridged version, 12 pages)
- Machine translation for closely related language pairs, Proceedings of the Workshop "Strategies for developing machine translation for minority languages" at LREC 2006, Genoa, Italy, May 2006, pp103-107. (PDF)
- I was on the program committee for the workshop TAL et langues peu dotées at TALN 2005 in Dourdan, France, June 6-10, 2005.
- I was also on the program committee for a similar workshop that was held at the ACL meeting in Ann Arbor, June 29-30, 2005: Building and using parallel texts for languages with scarce resources.
- Applications of parallel corpora to the development of monolingual language technologies. (PDF)
- Automatic thesaurus generation for minority languages: an Irish example, Actes de la 10e conférence TALN à Batz-sur-Mer du 11 au 14 Juin 2003, volume 2, pp 203-212. Paper presented at the workshop Traitement Automatique des Langues Minoritaires et des Petites Langues. (PDF)
- Hyphenation patterns for minority languages, TUGboat 24 (2003), no. 2, 236-239. (PDF)
- Global Software, Lecture on internationalization for undergraduates at Saint Louis University, April 8th, 2002. (PDF Slides)
Software Projects
I have written and actively maintain a number of open source software packages in support of minority languages and other languages with limited computational resources.
Corpora, web-crawling, search engines
- aimsigh.com. Linguistically sophisticated search.
- An Crúbadán. A web crawler for building minority language corpora automatically.
- Corpas Comhthreomhar Gaeilge-Béarla. An aligned parallel corpus of Irish and English texts.
- Internet Corpus of Welsh. Contains approximately 100 million words of Welsh. Now in use by the University of Wales Welsh Dictionary.
- Other Corpora. Asturian, Aymara, Basque, Breton, ... Venda, Walloon, Zulu.
Spell Checking and Grammar Checking
- An Gramadóir. An open source grammar checking engine that works with vim, emacs, and OpenOffice.
- charlifter. A script that performs statistical diacritic restoration. Pre-trained models are available for more than 100 languages (Irish, Lingala, Hawaiian, Samoan, ...)
- Hunspell for Bantu languages. Set of Perl scripts for morphological generation of verbs in Bantu languages, implemented here for Kinyarwanda; can be used to generate hunspell affix files automatically.
- GaelSpell. Irish spellcheckers for multiple platforms built from a single, high-quality database.
- Aspell. I'm using web crawling and statistical methods to develop new spell checking packages for a number of minority languages.
Lexicography
- Foclóir Nua Béarla-Gaeilge (The New English-Irish Dictionary). I am helping prepare Irish texts written in pre-standard orthography for indexing and inclusion in the project corpus.
- Líonra Séimeantach na Gaeilge. An Irish language semantic network ("WordNet"), available as a traditional thesaurus, or via a cool 3D browser.
- English-Irish-Afrikaans dictionary. Written with Darrin Speegle.
- Hyphenation. An Irish hyphenation dictionary adapted for use with TeX/LaTeX, Scribus, OpenOffice, etc.
Machine Translation
- ga2gd. Robust machine translation between closely-related languages. See student projects below.
- en2ga. Work in progress on English-to-Irish machine translation. Aiming at first release Summer 2012.
Human Translation
- GNU/Linux. Ever wonder how to say "in compatibility mode, the last two arguments must be offsets" in Irish? I am team leader at the GNU Translation Project.
- OpenOffice.org. I'm also coordinating the effort to translate OpenOffice.org into Irish. Here's an article about the launch that appeared in Lá Nua (on page 3).
- Mozilla. Localization of the Firefox web browser, Thunderbird email handler, and Sunbird calendar into Irish.
- KDE. Joint work with Séamus Ó Ciardhuáin.
Selected Student Projects
- Would you like a pop with that hoagie?. 3rd grade science fair project on dialect leveling by my son Kevin Scannell, Jr., 2010.
- Meta-data for web crawling. Export of the Crúbadán meta-data to XNL-RDF format, with Edward Jahn from George Mason University.
- Port of ga2gd to Apertium. Project by Sean Burke (University of Montana undergraduate) for the Google Summer of Code, 2009.
- Amharic-English Cross-Lingual Information Retrieval: A Corpus-Based Approach. By Aynalem Tesfaye, as part of her MS at Addis Ababa University, Department of Information Science, 2009.
- Language recognition and corpus building for Bosnian, Serbian, and Croatian. With Eldar Murselovic, SLU undergraduate.
- Bayesian grammar checker for Irish. Senior design project by Rich Barmeier.
- Example-based machine translation of open-source software packages. Senior design project by Regina Lennox, 2009.
- Who's afraid of the big bad Whorf?. 4th grade science fair project on linguistic relativity by my daughter Maddy, 2009.
- Port of ga2gd to Apertium, and development of Irish-Manx lexicon (ga2gv). Project by Joshua Glatt, Washington University undergraduate, 2008-2009.
- Translation recognition algorithm. Senior design project by Michael Henderson, 2008.
- Kinyarwanda morphology. Developed by Jackson Muhirwe as part of his Ph.D. work at Makarere University in Kampala, Uganda, 2007.
- Bosnian word-sense disambiguation. Senior design project by Jasmin Custic, 2007.