3.3. An assortment of less important files

The remainder of the files in the language pack require less attention, and some can be ignored entirely.

3.3.1. Segmentation

The grammar checker performs simple segmentation of the input text into sentences. It is possible to customize this for your language by editing the files giorr-xx.txt and giorr-xx.pre. The default language pack uses statistical methods to extract likely abbreviations from a text corpus (i.e. words that appear almost exclusively followed by a period "."). You'll find these in giorr-xx.txt. You may also want to uncomment the lines in giorr-xx.pre so that one letter abbreviations are escaped properly. Any other unusual conventions for ends of sentences should get encoded here.

3.3.2. Tokenization

You can specify how the grammar checker tokenizes the input stream by added rules to the file token-xx.in. You can use this to deal with URLs, email addresses, monetary amounts, ordinals (1st, 2nd, ...), etc. in a clean, uniform way. The syntax of this file looks like:


The rules are applied in the order they are specified in the file. Applying a rule amounts to matching the regular expression globally in the input and surrounding the matched text with the specified tag. The regular expression will not match within or across already-recognized tokens, so you will want to give rules for longer, more complicated tokens like URLs first.

3.3.3. Files to leave alone