Class Wrapper #2: Aspell
At work, I maintain a system in charge of aggregating hundreds of thousands of data records on a nightly basis. Some of this data comes from XML, some from CSV, and some from clients who ask us to crawl their site to retrieve their data. The system works really well (although aging badly).
There’s one problem though – the data, no matter in what form it hits our system – all comes from people. I don’t like people, because they make mistakes. They misspell. They omit some things, and add too much elsewhere. They don’t know how to normalize data sets.
“Quit complaining,” you mumble. “That’s your job!”
Quite right. For about a year now, the sum total of my issues with this system have been focused around reconciling poorly formed data. Most of the time, my data processing is due to a misspelling, or a mistyping, or something of the sort. And when I say “most of the time,” I mean most of the time. I’d wager easily 90%, or more. Here at the office full time again (since I’m back from school for the summer) I’ve been asked to shift my focus to the clarity and accuracy of our database almost exclusively (prior to this, I’ve done everything from system administration and network administration, to office tech support, to even some customer support).
Happy to oblige (no more constant distractions), I had time to sit back for a couple of hours and survey our technologies’ landscape. I quickly realized that spelling, typing, and translation errors constituted most of our woes.
Enter GNU Aspell, recommended to me about a year ago by a great guy, one of the creators of Birdstack. I was promptly distracted from looking into Aspell, and the idea hung motionless in my subconscious “to do” list until earlier this week.
Designed as a replacement for Ispell (an old Unix command for multilingual spell checking), Aspell’s major advantage over alternatives is that it does a great job at suggesting possible replacement words for misspellings. Add to that support for multiple custom dictionaries and also personal (and also session-based) dictionaries, and Aspell (with it’s C language binding) makes the perfect choice for domain-specific text correction.
The C binding is very easy to use (see example). You create a config object, you create a spellchecker object using the config object, and then you go to town, check()ing whether or not words are in the dictionary, and optionally returning linked lists of correction suggest()ions.
First, I attempted to home-grow a solution to the spelling situation. It didn’t seem like a large enough problem to warrant inclusion of a third party system. Turns out, it is. In my research, I discovered two promising algorithms of detecting misspellings and ranking suggestions. One, was the old fashioned Russel Soundex and it’s derivatives and cousins, such as the Double Metaphone, and the NYSIIS. These algorithms attempt to capture the phonetic fingerprint of a word in a short sequence of characters. Thus, words that sound similar, will have similar or matching soundex codes.
The other was the Levenshtein algorithm. If you’ve ever done a word ladder puzzle, you’re already familiar with the idea. Taking two strings, this clever algorithm compares them side by side, to calculate what’s called the “edit distance,” or, the number of operations (add a letter, delete a letter, change a letter from one thing to another) required to make the transformation between one string and the other. The lower the edit distance, the more similarly spelled the two words are.
Aspell is brilliant in that it precalculates the soundex of each word in it’s dictionary files, and then at runtime compares the edit distances of the soundexes to formulate the basis of it’s suggestion algorithm. Beautifully elegant. Here’s my wrapper. It’s still pretty early in development, so I’ll probably be updating it a few times over the next couple of weeks.
Header: speller.h
Implementation: speller.cpp
Test driver: speller_main.cpp
It’s as straightforward to use as I could make it. Create a Speller object, config()ure it to your preference, init()ialize it. The extra init() call, which I normally try to avoid, was necessary because of the way that the underlying AspellSpeller object is configured (with respect to the AspellConfig object) – unless I wanted to roll a 2 class design, which seemed even less usable.
You’ll need to have installed the Aspell development package to use this, and then link with -laspell. Cheers!