Natural Language Processing
Más fearr leat, tá leathanach Gaeilge agam chomh maith.
Lectures and Papers
- Linguistic resources from web corpora (a.k.a "The worst archive ever"), 2nd INNET conference on digital language archiving, Gniezno, Poland, 7 September 2013 (Slides)
- Endangered languages and social media, workshop at INNET Summer School on Technological Approaches to the Documentation of Lesser-Used Languages, Gniezno, Poland, 5 September 2013 (Slides)
- Unicode for Linguists, workshop at INNET Summer School on Technological Approaches to the Documentation of Lesser-Used Languages, Gniezno, Poland, 4 September 2013 (Slides, Tutorial)
- How many languages are on the web? The Crúbadán project 10+ years on, invited talk at the Workshop on Corpus-based Quantitative Typology (CoQuaT 2013) in Leipzig, 14 August 2013 (Slides)
- Tweet2Learn: Language Learning via Social Media, paper presented at the 19th annual NAACLT conference, University of Ottawa, 30 May 2013. Paper given in Irish with simultaneous translation into English by Eoin Ó Catháin (Slides, Paper)
- Localization at Mozilla, overview of processes and tools for localization at Mozilla, University of Limerick, 21 May 2013. Presented jointly with Jeff Beatty, l10n Program Manager at Mozilla. (Slides)
- Localization in minority language contexts, presented at the University of Limerick L10N summer school, 21 May 2013. (Slides)
- Facebook i18n Workshop, hands-on session covering localization issues that arise for under-resourced languages, University of Limerick, 21 May 2013. (Slides, Example Strings)
- Machine Learning in Natural Language Processing, lecture at the St. Louis Machine Learning and Data Science meetup, 15 May 2013. (Slides)
- Regional and Minority Languages in Social Media, lecture at the Illinois High School Translation Competition, University of Illinois EU Center, 2 May 2013.
- Indigenous Tweets, Visible Voices, and Technology, panel discussion at SXSW '13, Austin, Texas, with Kara Andrade, Maite Goñi, and Peter Rohloff, 9 March 2013. (Paper, Live Tweets, Slides)
- Teangacha Mionlaigh sa Ré Dhigiteach: Tionchar na Meán Sóisialta, presented at the seminar "Ar Scáth a Chéile a Mhairimid", Baile Átha Cliath, 2 March 2013. (Slides)
- Mionteangacha ar an Idirlíon: ag sárú na gconstaicí, video lecture at Féile Imbolc, Baile Bhúirne, Contae Chorcaí, 16 February 2013. (Video, Slides)
- Social media and revitalization of the Celtic languages, paper presented at the Symposium for the 20th anniversary of the European Charter for Regional or Minority Languages, 5 November 2012, University of Illinois at Urbana-Champaign. (Paper)
- An Ríomhlóid Abú!, panel discussion with Ciarán Ó Bréartúin agus Séamus Ó Briain at Oireachtas na Samhna, Leitir Ceanainn, Dún na nGall, 2 November 2012. (Slides 1, Slides 2)
- I dTreo Ríomhaireachta Lán-Ghaeilge, public lecture at Engineers Ireland, Baile Átha Cliath, 31 October 2012. (Slides)
- Translating Facebook into Endangered Languages, in Language Endangerment in the 21st Century: Globalisation, Technology and New Media. Proceedings of the 16th Foundation for Endangered Languages Conference, Auckland, Aotearoa/New Zealand, 12-15 September 2012, pp. 106-110. (Paper, Slides, News Coverage)
- The Irish Language in the Digital Age / An Ghaeilge sa Ré Dhigiteach (with John Judge, Ailbhe Ní Chasaide, Rose Ní Dhubhda, Elaine Uí Dhonnchadha), META-NET White Paper Series, Berlin: Springer-Verlag, 2012 (PDF)
- Facebook in all six Celtic languages (mar dhea), paper presented at the 18th annual NAACLT conference, Indiana University, May 2012. (PDF)
- Fíor-Ghaeltacht nó Gaeltacht Fhíorúil? Meáin Shóisialta agus Forbairt na Gaeilge i Meiriceá Thuaidh, paper given at the "Fiche Bliain" conference at the University of Ottawa, 28 October 2011. (Paper, Slides, Video (at 32:23))
- New computational resources for indigenous and minority languages, paper presented at the 17th annual NAACLT conference, Isle of Man, 13 May 2011. (PDF)
- Saving Languages with Statistics, survey talk on statistical NLP and endangered languages, University of Oregon, 14 January 2011. (PDF)
- Accentuate Us!, talk at Saint Louis University with Michael Schade, 10 November 2010. (PDF, Video)
- Turning the Tide: Terminology Creation and Free Software in Irish, invited talk at the first annual Indigenous Language Institute Symposium Series, Pueblo Isleta, New Mexico, 11 October 2010. (PDF)
- Statistical Unicodification of African Languages, Language Resources and Evaluation 45 (2011), no. 3, 375-386. (PDF)
- Saving languages with statistics and web crawlers, talk on diacritic restoration ("charlifter") at St. Louis BarCamp, Washington University, 7-8 November 2009. (PDF)
- Vox Humanitatis podcast on machine translation for African languages, with Bèrto 'd Sèra and Martin Benjamin, 5 November 2009. (Listen now)
- Standardization of corpus texts for the New English-Irish Dictionary, paper presented at the 15th annual NAACLT conference, New York, 22 May 2009. (PDF)
- Statistical language processing (and Perl), survey talk for the St. Louis Perl Mongers, 21 January 2009. (PDF)
- Free software for indigenous languages, to appear in Native Language Network, a publication of the Indigenous Language Institute. (PDF)
- Semi-automated construction of semantic networks using web corpora, paper presented at the "Words, Texts and Dictionaries" conference, University of Wales Centre for Advanced Welsh and Celtic Studies, Aberystwyth, 18 October 2008. (PDF)
- Open source language technology using statistical methods, survey talk for the St. Louis Unix Users Group, 21 August 2008. (PDF)
- An Gramadóir: A grammar-checking framework for the Celtic languages and its applications, paper presented at the 14th annual NAACLT conference, Madog Center for Welsh Studies, University of Rio Grande, 12-13 June 2008. (PDF)
- Language technology from scratch. Video lecture at the Vox Humanitatis conference on technology for minority languages held as part of the 41st Festival of the Piedmont Region, Cherasco, Italy, May 2008. (PDF Transcript)
- I was the featured speaker at the 3rd Web as Corpus Workshop (WAC3) in Louvain-la-Neuve, Belgium, 15-16 September 2007. (Main talk: PDF. Panel discussion: PDF)
- The Crúbadán Project: Corpus building for under-resourced languages, Cahiers du Cental 4 (2007), pp5-15, C. Fairon, H. Naets, A. Kilgarriff, G-M de Schryver, eds., "Building and Exploring Web Corpora", Proceedings of the 3rd Web as Corpus Workshop in Louvain-la-Neuve, Belgium, September 2007. (PDF)
- Translations of free software into Irish (with Séamus Ó Ciardhuáin), Translation Ireland 17 (2007), no. 2, 19-30. Special issue on "Translation and Irish in the 21st Century". (PDF)
- Implementing NLP Projects for Non-Central Languages: Instructions for Funding Bodies, Strategies for Developers (with Oliver Streiter and Mathias Stuflesser), Machine Translation 20 (2006), no. 4, 267-289. (PDF: published version, 28 pages, PDF: abridged version, 12 pages)
- Machine translation for closely related language pairs, Proceedings of the Workshop "Strategies for developing machine translation for minority languages" at LREC 2006, Genoa, Italy, May 2006, pp103-107. (PDF)
- Applications of parallel corpora to the development of monolingual language technologies. (PDF)
- Automatic thesaurus generation for minority languages: an Irish example, Actes de la 10e conférence TALN à Batz-sur-Mer du 11 au 14 Juin 2003, volume 2, pp 203-212. Paper presented at the workshop Traitement Automatique des Langues Minoritaires et des Petites Langues. (PDF)
- Hyphenation patterns for minority languages, TUGboat 24 (2003), no. 2, 236-239. (PDF)
- Global Software, Lecture on internationalization for undergraduates at Saint Louis University, 8 April 2002. (PDF Slides)
- Under-resourced Languages track at COLING 2012, Mumbai, India, December 2012.
- 7th Celtic Linguistics Conference, Rennes, France, June 2012.
- Workshop on Free/Open-Source Rule-Based Machine Translation, FreeRBMT '12 in Gothenburg, Sweden, 13-15 June 2012.
- Joint SALTMIL/AfLAT workshop at LREC 2012, Istanbul, Turkey, May 2012.
- 18th annual NAACLT conference, Indiana University, May 2012.
- Irish language conference "An fiche bliain atá romhainn: taighde agus teagasc na Gaeilge i Meiriceá Thuaidh" held at the University of Ottawa, October 2011.
- Workshop "Algorithms and resources for modelling dialects and language varieties" at EMNLP 2011, Dún Éideann, July 2011.
- 17th annual NAACLT conference, Ellan Vannin, May 12-13, 2011.
- 2nd International Workshop on Free/Open-Source Rule-Based Machine Translation FreeRBMT '11, Universitat Politècnica de Catalunya, January 2011.
- 16th annual NAACLT conference, Sabhal Mòr Ostaig, Isle of Skye, June 9-12, 2010.
- 6th Web as Corpus workshop (WAC6) at NAACL-HLT 2010 in Los Angeles, June 5th, 2010.
- Morphology and phonology program at NAACL-HLT 2010 in Los Angeles, June 1-6, 2010.
- SALTMIL workshop "Creation and use of basic lexical resources for less-resourced languages" at LREC 2010, Malta, May 23, 2010.
- 2nd AfLaT workshop on African language technology at LREC 2010, Malta, May 18, 2010.
- 1st International Workshop on Free/Open-Source Rule-based Machine Translation, Alacant, Spain, November 2-3, 2009.
- 5th Web as Corpus workshop (WAC5) in San Sebastián, Spain, September 7th, 2009.
- 4th Web as Corpus workshop (WAC4) in Marrakech, Morocco, June 1st, 2008.
- 3rd Web as Corpus workshop (WAC3) in Louvain-la-Neuve, Belgium, September 15-16, 2007.
- TAL et langues peu dotées at TALN 2005 in Dourdan, France, June 6-10, 2005.
- Building and using parallel texts for languages with scarce resources at the ACL meeting in Ann Arbor, June 29-30, 2005.
I have written and actively maintain a number of open source software packages in support of minority languages and other languages with limited computational resources.
Corpora, web-crawling, search engines
- Indigenous Tweets. A site that crawls Twitter and displays everyone tweeting in an indigenous or minority language.
- Indigenous Blogs. Directory of all blogs in indigenous or minority languages, with per-language RSS feeds.
- An Crúbadán. A web crawler for building minority language corpora automatically.
- Orthotree. Code and data for generating a phylogenetic tree of the world's languages. Read the blog post.
- Corpas Comhthreomhar Gaeilge-Béarla. An aligned parallel corpus of Irish and English texts.
- Internet Corpus of Welsh. Contains approximately 100 million words of Welsh. Now in use by the University of Wales Welsh Dictionary.
- Other Corpora. Asturian, Aymara, Basque, Breton, ... Venda, Walloon, Zulu.
Spell Checking and Grammar Checking
- An Gramadóir. An open source grammar checking engine that works with vim, emacs, and OpenOffice.
- accentuate.us (was "charlifter"). A web service and Firefox add-on that performs statistical diacritic restoration for more than 100 languages (Irish, Lingala, Hawaiian, Samoan, ...). Joint work with my student Michael Schade. You can also try it on the web thanks to my friends in Haiti at Logipam.
- Hunspell for Bantu languages. Set of Perl scripts for morphological generation of verbs in Bantu languages, implemented here for Kinyarwanda; can be used to generate hunspell affix files automatically.
- GaelSpell. Irish spellcheckers for multiple platforms built from a single, high-quality database.
- Aspell. I'm using web crawling and statistical methods to develop new spell checking packages for a number of minority languages.
- Adaptxt. Predictive text software for Irish, Scottish and Manx Gaelic on Android phones. With Michael Bauer.
- Foclóir Nua Béarla-Gaeilge (The New English-Irish Dictionary). I am helping prepare Irish texts written in pre-standard orthography for indexing and inclusion in the project corpus.
- Foclóir na Nua-Ghaeilge (Historical Dictionary of Modern Irish). Normalization and indexing of corpus texts.
- L. S. Gogan's Irish Dictionary. Similar standardization work for this incredible manuscript dictionary.
- Líonra Séimeantach na Gaeilge. An Irish language semantic network ("WordNet").
- English-Irish-Afrikaans dictionary. Written with Darrin Speegle.
- ga2gd. Robust machine translation between closely-related languages. See student projects below.
- en2ga. Work in progress on syntax-aware English-to-Irish machine translation.
- Srabble3D in Irish. Irish version of a customizable online Scrabble game.
- OCR for Gaelic fonts (seanchló). Training models for the open source OCR engine "tesseract".
- Hyphenation. Irish hyphenation patterns adapted for use with TeX/LaTeX, Scribus, OpenOffice, etc.
- Secwepemc-Facebook. Software for translating Facebook into any language. Based on work of Neskie Manuel.
- Skype in Your Language. I'm helping Michael Bauer with his project to enable localization of Skype into any language, using Transifex.
- GNU/Linux. Ever wonder how to say "in compatibility mode, the last two arguments must be offsets" in Irish? I am team leader at the GNU Translation Project.
- OpenOffice.org. I'm also coordinating the effort to translate OpenOffice.org into Irish. Here's an article about the launch that appeared in Lá Nua (on page 3).
- LibreOffice. The heir apparent to OpenOffice.org.
- Mozilla. Localization of the Firefox web browser, Thunderbird email handler, and Lightning calendar into Irish.
- KDE. Joint work with Séamus Ó Ciardhuáin.
- Much much more...
Selected Student Projects
- IPA Scrabble. NLP project by Steve Neuenhan.
- LibreOffice client for accentuate.us. Senior capstone project by Jason Lim.
- Gender identification on Twitter. 8th grade science fair project by Madeleine Scannell, 2012.
- Alarum For London. 7th grade science fair project on author identification by Madeleine Scannell, 2011.
- Barr do Theanga. Master's thesis by Liz Warren on Líonra Séimeantach na Gaeilge, at University College, Galway.
- Would you like a pop with that hoagie?. 3rd grade science fair project on dialect leveling by my son Kevin Scannell, Jr., 2010.
- Meta-data for web crawling. Export of the Crúbadán meta-data to XNL-RDF format, with Edward Jahn from George Mason University.
- Port of ga2gd to Apertium. Project by Sean Burke (University of Montana undergraduate) for the Google Summer of Code, 2009.
- Amharic-English Cross-Lingual Information Retrieval: A Corpus-Based Approach. By Aynalem Tesfaye, as part of her MS at Addis Ababa University, Department of Information Science, 2009. (PDF)
- Language recognition and corpus building for Bosnian, Serbian, and Croatian. With Eldar Murselovic, SLU undergraduate.
- Bayesian grammar checker for Irish. Senior design project by Rich Barmeier.
- Example-based machine translation of open-source software packages. Senior design project by Regina Lennox, 2009.
- Who's afraid of the big bad Whorf?. 4th grade science fair project on linguistic relativity by my daughter Maddy, 2009.
- Port of ga2gd to Apertium, and development of Irish-Manx lexicon (ga2gv). Project by Joshua Glatt, Washington University undergraduate, 2008-2009.
- Translation recognition algorithm. Senior design project by Michael Henderson, 2008.
- Kinyarwanda morphology. Developed by Jackson Muhirwe as part of his Ph.D. work at Makarere University in Kampala, Uganda, 2007.
- Bosnian word-sense disambiguation. Senior design project by Jasmin Custic, 2007.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.