Duncan Temple Lang

Department of Statistics, UC Davis

Overview

Aspell is software that provides facilities for checking the spelling of words. Aspell is both an (command-line) application and a C library that can be used other applications. This architecture enables us to dynamically load it into R and make its spell checking facilities available to R users. Why would this be useful? Firstly, one can use R to perform spelling in general contexts as one could in other general programming languages. Importantly, we as statisticians also increasingly deal with text as a data source or format and perform analysis on collections of documents. We might be interested in the number or pattern of mis-spelled words and we need to be able to identify them. Also, we work with help files and other forms of documentation and need the contents to be spelled correctly. This is typically left to other programs such as editors like emacs, but quality control and context-specific spelling can be done more naturally within R. In this case, spelling facilities are needed in R rather than R facilities are needed in the editor. One can of course use ESS or other tools to add context-specific inforamtion to the editor, but that is not our goal here. And lastly, it will be interesting to see if we can improve error messages by identifying mis-spelled variable names using a spell-checker. By making a spell checking facility available to R, perhaps others can improve the help facilities, error messages, and general interaction by trapping spelling mistakes and presenting intelligent, context-specific suggestions for alternatives.

The aspell library provides a C interface that can be used to check the spelling of words and the Aspell package contains the functions in R to access that interface. As with all packages, we must first load it to use its functions.
 library(Aspell)

The Basics

The work-horse function in the Aspell package for R is aspell (or simply spell). The simplest way to use this is to pass it a vector of words. The result is a logical vector of the same length as the vector of words, and each element is a logical value indicating whether the corresponding word was spelled correctly (TRUE) or not (FALSE).
 aspell("duncan")
The result is FALSE since the word is not correctly spelled - we need a capital 'D'. The next thing we might want to do is obtain a list of possible alternative spellings. The aspell does this for us also if we ask it to suggest alternatives.
 
 aspell("duncan", TRUE)
On my machine (using an English-US (en_US) dictionary), I get the following values returned:
$duncan
 [1] "Duncan"   "dun can"  "dun-can"  "Dunc"     "Duncan's" "dungeon" 
 [7] "dunging"  "dunking"  "Deccan"   "Dunc's"   "Tuscan"   "cancan"  
[13] "uncanny"  "Tongan"   "Danica"   "Donica"   "dung"     "dunk"    
[19] "toucan"  
The suggested alternatives are ranked from most-likely to least likely. One can control how aspell determines these possible alternatives, trading speed for accuracy and different types of spellers. But we will return to this.

For the record, if a word is spelled correctly and we ask for suggestions, that word will appear in the suggestions vector. It may be the first word, but one cannot guarantee that. For example,
aspell("aspell", suggests = TRUE)
$aspell
 [1] "Aspell"   "aspell"   "asp ell"  "asp-ell"  "Ispell"   "ispell"  
 [7] "spell"    "pspell"   "Aspell's" "Ascella"  "spill"    "Ispell's"

We can spell check multiple words in a single call.
 
 aspell(c("duncan", "temple", "lang"))
The result is
duncan temple   lang 
 FALSE   TRUE   TRUE 
indicating that the last two words were spelled correctly, but, as before, the first was not.

And we can also get suggestsions for a collection of words in a single call. For instance,
aspell(c("misestimation", "statistcs"), TRUE)
$misestimation
[1] "estimation"   "mastication"  "molestation"  "ministration" "misquotation"
[6] "menstruation"

$statistcs
[1] "statistics"  "statistic's" "statistic"   "statistical" "statics"    
[6] "stylistics"  "statics's"   "sadists"     "sadist's"   

Documents

The aspell command allows the user to check an individual word or vector of words. It is more common that we will have an entire document of words to spell check. For example, we will want to spell check a particular document. Let's consider an example of an R help file, coming from this package in particular. We can use the Rd_parse in tools to parse the aspell.Rd file:
els = Rd_parse(system.file("man", "aspell.Rd", package = "Aspell"))
From this, let's look at the description field from the file, els$data[3, 2]. Let's suppose we just want to collect the words that aspell believes are mis-spelled. As we encounter them, we will print them on the console. We can do this via the spell-checker as follows:
txt = els$data[3, 2], 
spellDoc(txt, function(word, ...) { cat(word, "\n") })
As the spell-checker encounters each mis-spelled word, it invokes the function we gave it as the second argument. In our example, we merely print the value on the screen/console. One can do more interesting things such as prompting the user for an alternative spelling of the word, accompanied by possible suggestions from aspell. This is what the default handler for spellDoc does. For each word that the speller identifies as mis-spelled, the handler prompts the user with the collection of suggestions the speller provides. If the user selects one of these, the handler reports this "correction" back to the speller so that the speller can learn for future words or misspellings of the same word. If the user accepts the orginal word, the handler notifies the speller that this is a legitimate word and so it will not be signalled as mis-spelled in the future. Both of these feedback techniques can be turned off via the correct argument of the DocSpeller function which creates the handler function.

We can also just collect the mis-spelled words across the calls and then retrieve them as a vector of words. The function collectWords is available to do this:
spellDoc(file(system.file("INSTALL", package = "Aspell")), collectWords())
One does not need the call to file here, but it removes any ambiguity that we are to spell the contents of the file rather than the word that is the file name itself.

The checker argument of spellDoc is expected to be an object of class AspellDocChecker. By default, this is created each time we call spellDoc. However, one can create just one of these objects and reuse it across multiple documents or spelling of the same documents. The same applies to the speller and config arguments also.

The AspellSpeller class

If we just want to use this Aspell package to check the spelling of words, the aspell function as we have described it will do the job. Essentially, it consults a fixed dictionary and finds out whether the word is spelled correctly, and if not, finds the "best" alternatives. That sounds sufficient. However, when we are spelling lots of words and want to control and connect the spelling process, then we need more control.

The last parameter of aspell is the speller object. By default, if this is not specified, a new AspellSpeller object is created. When the spelling is completed, that speller is discarded. As a result, there is no continuity between spelling operations. If we want there to be continuity, we need to use the same speller. To do this, we create the speller separately and then pass that in each call. We use getSpeller to get an instance of the speller:
 sp = getSpeller()

Given the speller, we can pass this to the different spelling functions.

We can create as many spellers as we want. This is useful if we want to use different settings for the different spellers, e.g. different dictionaries.

Each speller has options that control how it behaves. One can set these

Managing Words and Corrections

The aspell library, and consequently the Aspell package, provides facilities for educating the spell checker about what words should be considered correct and also for training it to offer better suggestions based on previous discoveries.

One can pass a collection of words to the speller that should be treated as correct. We do this via the function addToList. For example, suppose we want to recognize the words "omegahat" and "SJava" as legitimate for a particular speller instance. We would issue the command
sp = getSpeller()
addToList(c("omegahat", "SJava"), sp)
Now, when we use this particular speller object to check the words "omegahat" or "SJava", it will report that they are correctly spelled. If we use a different speller, on the other hand, that one will report that they are mis-spelled. In this way, we can maintain multiple, independent spellers that have different knowledge. We can do this easily in R by assigning the different spellers to different variables or elements of a list and using the appropriate one for the intended purpose.

Before spelling a document, we can add any context-specific words that should be treated as correct but that would ordinarily be flagged as mis-spellings. For example, if were checking the spelling of a help (Rd) file in R, we might first inform the speller that the names of the functions within the package are legitimate. We can do this with
 addToList(objects(2), sp)
assuming the package is loaded in the second place in the search path.

Additionally, we might add the names of the formal arguments/parameters for the functions being documented so that references to them in the text are not mistaken as odd words. Again, something like
 addToList(names(formals(func)), sp)
does the job. Of course, we have to use the same speller object (sp) when doing the spell-checking.

A possibly more convenient interface for adding to the list of correctly spelled words is to use the $ operator for a AspellSpeller object. We have provided overloaded methods for speller objects and one of them is the session "field". We can use this as
 
  sp$session  = c("omegahat", "SJava")
This tells R to add the words on the right hand side of the assignment to the session word list of correctly spelled words. This is merely a call to addToList and there is no difference other than convenience.

If one looks at the potential arguments for addToList, one sees that there is a session parameter. This is a logical value that controls whether the words are added to the speller's session list of words, or alternatively to the speller's personal list of words. These two lists are maintained separately. The essential difference is that we can save the personal list back to a central file when we wish, whereas the session word list will go away when the speller disappears. So this interface allows us to differentiate between types of words and their future use for spelling.

The session word list can be retrieved at any moment using the function getWordList. This allows us to manage the lists ourselves rather than relying on the personal words file, etc. In addition to the session and personal list, there is also the main word list for a speller. Any of these 3 lists can be retrieved by specifying the names of interest as the second argument to getWordList. For example, to get the session and personal lists, we use getWordList(speller, c("session", "personal"))

We next turn our attention to how we can add words to the speller to tell it about words that were mis-spelled and what the correct version should be. The benefit of this is that we are providing feedback to the speller so that it can use this in future spell checking to provide better suggestions. We add mis-spelled and corrected pairs of words to the speller using the addCorrection function. The first argument is the speller object. When used interactively, i.e. at the command line, it is easise to then provide the mis-spelled/correct pairs directly as named arguments (in that order of pairing). For example,
addCorrection(sp, duncn = "duncan",  ro = "rho", statistcs = "statistics")
This tells the speller that it should suggest "duncan" when it finds the mis-spelled word "duncn", and similarly "ro" should lead to "rho". We can test this
sp$spell("ro")
$ro
 [1] "rho" "RI"  "Rio" "Roi" "Row" "Roy" "roe" "row" "OR"  "or"  "R"   "r"  
[13] "ROM" "RP"  "Rob" "Rod" "Rog" "Rom" "Ron" "Ros" "Roz" "rob" "rod" "rot"
[25] "to"  "O"   "o"   "RR"  "Ra"  "Re"  "Rh"  "Ru"  "Ry"  "re"  "PRO" "SRO"
[37] "bro" "fro" "pro" "RC"  "RD"  "RF"  "RN"  "RV"  "Rb"  "Rd"  "Rf"  "Rn" 
[49] "Rx"  "rd"  "rm"  "rs"  "rt"  "BO"  "Bo"  "CO"  "Co"  "Ho"  "Io"  "Jo" 
[61] "KO"  "MO"  "Mo"  "No"  "PO"  "Po"  "SO"  "co"  "do"  "go"  "ho"  "lo" 
[73] "mo"  "no"  "so"  "yo"  "R's"
and, as expected, "rho" comes first.

We can save the word lists to their appropriate files using saveWordLists or sp$save(). This causes aspell to upadte the user's files with information in the current word lists.

Configuration options

The aspell library is quite customizable in which dictionaries it uses, how it provides suggestions for alternative spellings/words, deals with words that might be run-together, encoding of characters, and so on. While aspell reads the settings from site and user files, we can also specify these values dynamically within an R session.

Each speller has its own configuration set. This allows us to have different spellers for different purposes. For example, we may have one spelling in English and another in US English. These sorts of options are controlled by the speller's configuration. Given a speller, we can get its configuration object using either of the commands
 conf = sp$conf
or
 conf = getConfig(sp)
The resulting object is of class AspellConfig. We can treat this as if it were a list in R. Each of the options it supports can be thought of as an element in the list. We can find the names of the elements/options using the names function:
names(conf)
 [1] "actual-dict-dir"      "actual-lang"          "affix-char"          
 [4] "affix-compress"       "backup"               "byte-offsets"        
 [7] "clean-affixes"        "clean-words"          "conf"                
[10] "conf-dir"             "conf-path"            "data-dir"            
[13] "dict-alias"           "dict-dir"             "encoding"            
[16] "extra-dicts"          "filter"               "filter-path"         
[19] "guess"                "home-dir"             "ignore"              
[22] "ignore-accents"       "ignore-case"          "ignore-repl"         
[25] "invisible-soundslike" "jargon"               "keyboard"            
[28] "keymapping"           "lang"                 "language-tag"        
[31] "local-data-dir"       "master"               "master-flags"        
[34] "master-path"          "mode"                 "module"              
[37] "module-search-order"  "norm-form"            "norm-required"       
[40] "norm-strict"          "normalize"            "partially-expand"    
[43] "per-conf"             "per-conf-path"        "personal"            
[46] "personal-path"        "prefix"               "repl"                
[49] "repl-path"            "reverse"              "run-together"        
[52] "run-together-limit"   "run-together-min"     "save-repl"           
[55] "set-prefix"           "size"                 "skip-invalid-words"  
[58] "spelling"             "sug-edit-dist"        "sug-mode"            
[61] "sug-repl-table"       "sug-split-char"       "sug-typo-analysis"   
[64] "suggest"              "time"                 "use-other-dicts"     
[67] "validate-affixes"     "validate-words"       "variety"             
[70] "warn"                 "word-list-path"      
This shows that there are currently 71 different options. Each of the different options has one of 4 basic types of values. These are integer, string, boolean and list. (There are also others such as file, but these are just types of strings and are not dealt with separately in the aspell library.) We can find out more about the options and their types using getSpellInfo. This returns a description of class KeyInfo for each element in the configuration object. Each KeyInfo object has the name of the option being described. It also provides the type of the acceptable value. This type information is given as a named scalar integer. The name gives the human-readable form and is one of the 4 basic types. Each information object also provides the default value in the def slot. And, importantly, the desc slot in most objects gives a brief description of what the option controls. The remaining two fields in the object are not of interest to us at this point.

The important thing about the configuration objects is that we can both get and set values. The function getSpellConfig is the underlying mechanism to query a value. Given a AspellConfig class however, we can use the short-hand form via the [ and $ operators. If we want to fetch the value of one element in the configuration, we can use $, as in
 conf$lang
or
 conf$"data-dir"
Note that we have to quote names with a '-' in them. This form returns the value for that option.

If we want to get the values for multiple options in a single call, we can use [. For example,
conf[c("mode", "filter", "lang")]
returns a vector containing the three values
   mode  filter    lang 
  "url"   "url" "en_US" 
If the values were of different types (e.g. an integer and a string), the results are coreced to the common type. This is the result of a call to sapply. For example,
 conf[c("run-together-min", "filter", "lang")]
yields
run-together-min           filter             lang 
             "3"            "url"          "en_US" 
yet the "run-together-min" option is an integer.

On the other hand, if the values of different lengths (e.g. one is a list of strings and an other an integer), the result would be a list. For example, if "filter" is a list with two elements, then
conf[c("run-together-min", "filter", "lang")]
produces
$"run-together-min"
[1] 3

$filter
[1] "email" "html" 

$lang
[1] "en_US"

Setting Options

Just as we can get the values of options in the AspellConfig objects, we can also set values for options. Analgous to getSpellConfig, we use setSpellConfig. This takes a configuration object and then a collection of name-value pairs giving the option name and its new value. The function takes care of coercing logicals to the appropriate form (e.g. FALSE to "false"). It also handles assigning multiple values to list type options. Thus, we can use it as
 setSpellConfig(conf, warn = FALSE, lang = "en", filter = c("email", "html"))
Then, the command
conf[c("warn", "lang", "filter")]
gives
$warn
[1] FALSE

$lang
[1] "en"

$filter
[1] "email" "html" 

As we used the $ operator to access individual values, we can also assign a value to an individual option using $<-. For example, we can set the "warn" option to FALSE via
conf$warn = FALSE
This approach also understands multiple values.
 conf$filter = c("email", "html")
As one would expect, we can append to an existing list using
 conf$filter = c("email", "html", conf$filter)
The [<- does not work in this context.

The $<- mechanism uses setSpellConfig to perform its task. And using setSpellConfig hides the details of coercing the values and setting multiple values in a list. However, if one wants to, one can do this directly using the "add-option name", "clear-option name" and "rem-option name" prefixes to the filter.