Natural Language Processing
Quick links: accentuate.us | gramadoir | litreoir | crubadan | ccgb | lsg | gnu | fleiscin
Más fearr leat, tá leathanach Gaeilge agam chomh maith.
Lectures and Papers
- Fíor-Ghaeltacht nó Gaeltacht Fhíorúil? Meáin Shóisialta agus Forbairt na Gaeilge i Meiriceá Thuaidh, paper given at the "Fiche Bliain" conference at the University of Ottawa, October 28, 2011. (PDF)
- New computational resources for indigenous and minority languages, paper presented at the 17th annual NAACLT conference, Isle of Man, May 13, 2011. (PDF)
- Saving Languages with Statistics, survey talk on statistical NLP and endangered languages, University of Oregon, January 14, 2011. (PDF)
- Accentuate Us!, talk at Saint Louis University with Michael Schade, November 10, 2010. (PDF, Video)
- Turning the Tide: Terminology Creation and Free Software in Irish, invited talk at the first annual Indigenous Language Institute Symposium Series, Pueblo Isleta, New Mexico, October 11, 2010. (PDF)
- Statistical Unicodification of African Languages, Language Resources and Evaluation 45 (2011), no. 3, 375-386. (PDF)
- Saving languages with statistics and web crawlers, talk on diacritic restoration ("charlifter") at St. Louis BarCamp, Washington University, November 7-8, 2009. (PDF)
- Vox Humanitatis podcast on machine translation for African languages, with Bèrto 'd Sèra and Martin Benjamin, November 5, 2009. (Listen now)
- Standardization of corpus texts for the New English-Irish Dictionary, paper presented at the 15th annual NAACLT conference, New York, May 22, 2009. (PDF)
- Statistical language processing (and Perl), survey talk for the St. Louis Perl Mongers, January 21, 2009. (PDF)
- Free software for indigenous languages, to appear in Native Language Network, a publication of the Indigenous Language Institute. (PDF)
- Semi-automated construction of semantic networks using web corpora, paper presented at the "Words, Texts and Dictionaries" conference, University of Wales Centre for Advanced Welsh and Celtic Studies, Aberystwyth, October 18, 2008. (PDF)
- Open source language technology using statistical methods, survey talk for the St. Louis Unix Users Group, August 21, 2008. (PDF)
- An Gramadóir: A grammar-checking framework for the Celtic languages and its applications, paper presented at the 14th annual NAACLT conference, Madog Center for Welsh Studies, University of Rio Grande, June 12-13, 2008. (PDF)
- Language technology from scratch. Video lecture at the Vox Humanitatis conference on technology for minority languages held as part of the 41st Festival of the Piedmont Region, Cherasco, Italy, May 2008. (PDF Transcript)
- I was the featured speaker at the 3rd Web as Corpus Workshop (WAC3) in Louvain-la-Neuve, Belgium, September 15-16, 2007. (Main talk: PDF. Panel discussion: PDF)
- The Crúbadán Project: Corpus building for under-resourced languages, Cahiers du Cental 4 (2007), pp5-15, C. Fairon, H. Naets, A. Kilgarriff, G-M de Schryver, eds., "Building and Exploring Web Corpora", Proceedings of the 3rd Web as Corpus Workshop in Louvain-la-Neuve, Belgium, September 2007. (PDF)
- Translations of free software into Irish (with Séamus Ó Ciardhuáin), Translation Ireland 17 (2007), no. 2, 19-30. Special issue on "Translation and Irish in the 21st Century". (PDF)
- Implementing NLP Projects for Non-Central Languages: Instructions for Funding Bodies, Strategies for Developers (with Oliver Streiter and Mathias Stuflesser), Machine Translation 20 (2006), no. 4, 267-289. (PDF: published version, 28 pages, PDF: abridged version, 12 pages)
- Machine translation for closely related language pairs, Proceedings of the Workshop "Strategies for developing machine translation for minority languages" at LREC 2006, Genoa, Italy, May 2006, pp103-107. (PDF)
- Applications of parallel corpora to the development of monolingual language technologies. (PDF)
- Automatic thesaurus generation for minority languages: an Irish example, Actes de la 10e conférence TALN à Batz-sur-Mer du 11 au 14 Juin 2003, volume 2, pp 203-212. Paper presented at the workshop Traitement Automatique des Langues Minoritaires et des Petites Langues. (PDF)
- Hyphenation patterns for minority languages, TUGboat 24 (2003), no. 2, 236-239. (PDF)
- Global Software, Lecture on internationalization for undergraduates at Saint Louis University, April 8th, 2002. (PDF Slides)
Program Committees
- Workshop on Free/Open-Source Rule-Based Machine Translation, FreeRBMT '12 in Gothenburg, Sweden, 13-15 June 2012.
- Joint SALTMIL/AfLAT workshop at LREC 2012, Istanbul, Turkey, May 2012.
- Irish language conference "An fiche bliain atá romhainn: taighde agus teagasc na Gaeilge i Meiriceá Thuaidh" held at the University of Ottawa, October 2011.
- Workshop "Algorithms and resources for modelling dialects and language varieties" at EMNLP 2011, Dún Éideann, July 2011.
- 17th annual NAACLT conference, Ellan Vannin, May 12-13, 2011.
- 2nd International Workshop on Free/Open-Source Rule-Based Machine Translation FreeRBMT '11, Universitat Politècnica de Catalunya, January 2011.
- 16th annual NAACLT conference, Sabhal Mòr Ostaig, Isle of Skye, June 9-12, 2010.
- 6th Web as Corpus workshop (WAC6) at NAACL-HLT 2010 in Los Angeles, June 5th, 2010.
- Morphology and phonology program at NAACL-HLT 2010 in Los Angeles, June 1-6, 2010.
- SALTMIL workshop "Creation and use of basic lexical resources for less-resourced languages" at LREC 2010, Malta, May 23, 2010.
- 2nd AfLaT workshop on African language technology at LREC 2010, Malta, May 18, 2010.
- 1st International Workshop on Free/Open-Source Rule-based Machine Translation, Alacant, Spain, November 2-3, 2009.
- 5th Web as Corpus workshop (WAC5) in San Sebastián, Spain, September 7th, 2009.
- 4th Web as Corpus workshop (WAC4) in Marrakech, Morocco, June 1st, 2008.
- 3rd Web as Corpus workshop (WAC3) in Louvain-la-Neuve, Belgium, September 15-16, 2007.
- TAL et langues peu dotées at TALN 2005 in Dourdan, France, June 6-10, 2005.
- Building and using parallel texts for languages with scarce resources at the ACL meeting in Ann Arbor, June 29-30, 2005.
Software Projects
I have written and actively maintain a number of open source software packages in support of minority languages and other languages with limited computational resources.
Corpora, web-crawling, search engines
- Indigenous Tweets. A site that crawls Twitter and displays everyone tweeting in an indigenous or minority language.
- Indigenous Blogs. Directory of all blogs in indigenous or minority languages, with per-language RSS feeds.
- An Crúbadán. A web crawler for building minority language corpora automatically.
- Corpas Comhthreomhar Gaeilge-Béarla. An aligned parallel corpus of Irish and English texts.
- Internet Corpus of Welsh. Contains approximately 100 million words of Welsh. Now in use by the University of Wales Welsh Dictionary.
- Other Corpora. Asturian, Aymara, Basque, Breton, ... Venda, Walloon, Zulu.
Spell Checking and Grammar Checking
- An Gramadóir. An open source grammar checking engine that works with vim, emacs, and OpenOffice.
- accentuate.us (was "charlifter"). A web service and Firefox add-on that performs statistical diacritic restoration for more than 100 languages (Irish, Lingala, Hawaiian, Samoan, ...). Joint work with my student Michael Schade. You can also try it on the web thanks to my friends in Haiti at Logipam.
- Hunspell for Bantu languages. Set of Perl scripts for morphological generation of verbs in Bantu languages, implemented here for Kinyarwanda; can be used to generate hunspell affix files automatically.
- GaelSpell. Irish spellcheckers for multiple platforms built from a single, high-quality database.
- Aspell. I'm using web crawling and statistical methods to develop new spell checking packages for a number of minority languages.
Lexicography
- Foclóir Nua Béarla-Gaeilge (The New English-Irish Dictionary). I am helping prepare Irish texts written in pre-standard orthography for indexing and inclusion in the project corpus.
- Líonra Séimeantach na Gaeilge. An Irish language semantic network ("WordNet"), available as a traditional thesaurus, or via a cool 3D browser.
- English-Irish-Afrikaans dictionary. Written with Darrin Speegle.
Machine Translation
- ga2gd. Robust machine translation between closely-related languages. See student projects below.
- en2ga. Work in progress on syntax-aware English-to-Irish machine translation.
Varia
- Srabble3D in Irish. Irish version of a customizable online Scrabble game.
- OCR for Gaelic fonts (seanchló). Training models for the open source OCR engine "tesseract".
- Hyphenation. Irish hyphenation patterns adapted for use with TeX/LaTeX, Scribus, OpenOffice, etc.
Human Translation
- GNU/Linux. Ever wonder how to say "in compatibility mode, the last two arguments must be offsets" in Irish? I am team leader at the GNU Translation Project.
- OpenOffice.org. I'm also coordinating the effort to translate OpenOffice.org into Irish. Here's an article about the launch that appeared in Lá Nua (on page 3).
- LibreOffice. The heir apparent to OpenOffice.org.
- Mozilla. Localization of the Firefox web browser, Thunderbird email handler, and Sunbird calendar into Irish.
- KDE. Joint work with Séamus Ó Ciardhuáin.
- Much much more...
Selected Student Projects
- Alarum For London. 7th grade science fair project on author identification by Madeleine Scannell, 2011.
- Barr do Theanga. Master's thesis by Liz Warren on Líonra Séimeantach na Gaeilge, at University College, Galway.
- Would you like a pop with that hoagie?. 3rd grade science fair project on dialect leveling by my son Kevin Scannell, Jr., 2010.
- Meta-data for web crawling. Export of the Crúbadán meta-data to XNL-RDF format, with Edward Jahn from George Mason University.
- Port of ga2gd to Apertium. Project by Sean Burke (University of Montana undergraduate) for the Google Summer of Code, 2009.
- Amharic-English Cross-Lingual Information Retrieval: A Corpus-Based Approach. By Aynalem Tesfaye, as part of her MS at Addis Ababa University, Department of Information Science, 2009. (PDF)
- Language recognition and corpus building for Bosnian, Serbian, and Croatian. With Eldar Murselovic, SLU undergraduate.
- Bayesian grammar checker for Irish. Senior design project by Rich Barmeier.
- Example-based machine translation of open-source software packages. Senior design project by Regina Lennox, 2009.
- Who's afraid of the big bad Whorf?. 4th grade science fair project on linguistic relativity by my daughter Maddy, 2009.
- Port of ga2gd to Apertium, and development of Irish-Manx lexicon (ga2gv). Project by Joshua Glatt, Washington University undergraduate, 2008-2009.
- Translation recognition algorithm. Senior design project by Michael Henderson, 2008.
- Kinyarwanda morphology. Developed by Jackson Muhirwe as part of his Ph.D. work at Makarere University in Kampala, Uganda, 2007.
- Bosnian word-sense disambiguation. Senior design project by Jasmin Custic, 2007.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.