GaelSpell Unix:
English summary
Kevin P. Scannell
Summary
This page has been provided as an aid to package maintainers who are unable to read the GaelSpell Unix home page which is entirely in Irish. This is in no sense a translation of the Irish page (which contains much more detailed descriptions).
Three packages are available from this site, providing Irish language support for the most widely used spellcheckers in the Open Source community: ispell-gaeilge, for Geoff Kuenning's International Ispell, aspell-gaeilge for Kevin Atkinson's Aspell, and hunspell-gaeilge for the OpenOffice.org spellchecker. The word lists are identical, just packaged differently.
Diarmaid Mac Mathúna has also repackaged the same underlying word list for use on Windows machines; it is available (under the GPL) from the primary GaelSpell site: www.gaelspell.com.
Features
- Large Word List. There are around 300,000 words in the database; this is, by my estimates, about five times larger than the new Irish spellchecker released by Microsoft (can't tell for sure -- it's closed-source!) The coverage is equivalent to a dictionary with around 26,000 headwords -- almost twice as big as a typical pocket dictionary (e.g. the Oxford or the Collins Gem).
- Grammatical Completeness. I have written software which generates every inflected form of a dictionary headword when provided with a limited amount of grammatical information. For instance, by adding the word fuaimnigh to the underlying database as a second declension verb, 87 inflected forms are added to the word list (all verb endings plus lenition, eclipsis, prefix "d'" etc.)
- Accuracy. The only absolute rule when generating a spellchecker is that there should be no misspelled words in the basic word lists. Every word has been checked against print sources at least once. The software which generates the inflected forms has been tested various ways, including through the use of the shell script "igcheck" which checks a word list for letter combinations which are illegal or "pre-standard" in Irish. The other word lists I've seen contain anywhere from 10% to 40% English or misspelled Irish words.
- Frequent Updates. I have provided major updates every six months or so since the initial release and plan to continue this for the foreseeable future. Candidates for addition to the word list are harvested via statistical methods using my web crawling software An Crúbadán; this is an effective way of keeping up with the latest terminology. I have also been adding words from the print dictionaries published by An Gúm and the resources available from acmhainn.ie.
- Dialect support (ispell only). There are three different installation options included with the ispell-gaeilge package, described below under Alternate Models.
- Phonetic support (aspell only). The file gaeilge_phonet.dat provides a complete "coarse" encoding of the pronunciation of Irish. This allows aspell to make more intelligent suggestions when it comes across a misspelled word. For instance, where ispell gives no suggestions for the pre-standard imfhiosach, aspell uses the phonetics file to encode this as "*M*S*K", thereby recognizing and suggesting the correct spelling iomasach.
Installation
There is no "configure" script -- after unpacking the tarball, try using "make" to build the hashed word list (gaeilge.hash for ispell and gaeilge for aspell). If you run into trouble you should edit the variables at the top of the Makefile to the appropriate directories and try again.
The ISO-639-1 code for Irish is "ga".
Alternate Models
The default hash table conforms strictly to standardized Irish spelling. You can generate either a "literary" or "dialect" model (ispell only) by changing the variable INSTALLATION at the top of the Makefile to gaeilgelit or gaeilgemor and using "make" as before.
The gaeilgelit model contains many obsolete or obscure (but standardly spelled) words which are probably best left out of any good Irish spellchecker. For instance, brúitíneach (a stumpy or stuffy person in Ó Dónaill) is a likely misspelling of the much more common word bruitíneach (the measles). Other typical "dangerous" word pairs: deirc for déirc. múid for muid, etc.
The gaeilgemor model, on the other hand, contains non-standard or dialect spellings (alongside the standard spellings) and accepts non-standard inflections of verbs. This greatly reduces its effectiveness as a spellchecking tool; indeed, anyone who uses non-standard forms so frequently that he or she finds the standard model inadequate will likely disagree with the very concept of an Irish spellchecker in the first place!
With all this in mind, I strongly urge installers to make the standard model the default on your system.
Contributing
Only a handful of people have corresponded with me about these packages and no one has contributed any words. Perhaps having an appeal in English will inspire some learners to contribute.
As noted above, I have the software infrastructure in place so that if someone sends me a list of "headwords" with grammatical information, all inflected forms of the word will be added automatically. This is a much more efficient (and less error-prone) way of building up a good word list than by adding forms one by one into a personal dictionary.
The word list already contains a large number of Irish personal names and placenames, but these were not added in any systematic way, and there are possibly quality issues. If there is anyone with a particular interest in this area that would like to help, please get in touch with me at kscanne at gmail dot com.
© Copyright 2002-2007 Kevin P. Scannell