Discussion Closed Temporarily

June 15th, 2009 Chris No comments

I’m in the process of moving this Blog to a new web host this morning, so until the DNS records propogate fully, no new content will be allowed (as per the steps outlined in this post).

Should be back up in a day or two.

Update: We’re back up, folks.

eval( “round 2″ )

June 12th, 2009 Chris 1 comment

In my previous post entitled Don’t be lazy. Don’t use eval(), I outlined a few reasons why eval() is a bad (if really, really, cool) function to use.  So bad for programmers (both practically and rhetorically) in fact, that my thesis has become to forget it’s very existence.  From that article:

we are spoiled.  All of us.  We are lulled into a false sense of security by believing that throwing more (and better) hardware at a problem is a sufficient excuse for writing poor code, shoddy algorithms, and overall paying less attention to detail.  Don’t get me wrong, Jeff’s argument is carefully thought out and well presented.  But I take it with a grain of salt.  In fact, lots of salt.  Don’t use fast hardware as an excuse to be lazy.  Where does the habit of eschewing proper paradigm and using your hardware as a crutch stop?

There are several points in that paragraph I think are worth noting.

  1. Yes, programmer time is expensive, and hardware is cheap.  There is a lot (a lot) of store to be set by simply getting  job X done quickly.  And, in certain circumstances, more hardware is the way to go.  Some problems can’t be solved by reducing algorithmic complexity and micro-optimizing code at the assembly level – some problems, by nature, require scale.  But that doesn’t mean that we can abandon complexity theory, cast memory consumption to the wind, and start blowing processor cycles on pointless memory lookups for every program we generate.
  2. eval(), in any language, is a security risk.  Perhaps that risk is minimized when running a sandbox, but there is still risk.  And if someone is determined enough to break something, they’re going to do it anyway.  The old adage “locks only keep honest men honest” applies here.  Why make it that much easier?  PHP and JavaScript can’t sandbox though, so you’d be living a fools paradise to use eval() in those places.  In Python, sure, you’ve got your “safe zone”.  But, as several readers have pointed out, there are still ways to exploit the system.  A precaution to this of course, is to perform input scrubbing, before running eval().  Ensure that all identifiers strung together in your to-be-eval()’d code are valid tokens in their own right.  But – at that point, you’re spending so much time worrying about hand-holding the code going through eval() that you might as well have just programmed it yourself using iteration, function pointers, callbacks, and the like.  The argument at the heart of the eval() security debate lies along the same lines as “blacklisting” versus “whitelisting”.  Cleansing eval() input is blacklisting.  Choosing an alternative method of doing the same thing without eval() is whitelisting.  This post traveled through a number of firewalls to reach your eyes.  Firewalls are whitelists.  Symantec is a blacklist.  Which fail more frequently, firewalls or antivirus programs?
  3. eval() can evoke massive laziness.  There have been about three times can recall considering it’s use (yes, against my own thesis).  Each time, I sat back, and thought about an alternative solution.  Each time, I was able to envision a module designed without eval() – and the said alternative designs were always more architecturally sound than the code produce when eval() was thrown in the mix.  “But it still would have been quicker to use eval()” you might say.  And you’d be right.  In the short term. Overall, maintenance would be more difficult, readability and reusability suffer, bugs become more difficult to track (even with unit testing) and you’ve done neither yourself, nor anyone else, any favors in taking the cheap route.

All that having been said, one reader noted that without eval(), Firebug (a web development plugin for Firefox without which I’d be lost) would not be possible.  I disagree with that statement.  I use Firebug daily, and I’ve used the console to run code on the fly a grand total of two times – and one of them was just for the sake of trying it.  Firebug presents such advanced reporting (both in DOM traversal, scripting errors, and traffic analyzation) that I simply don’t use the console in that fashion – and I don’t see how anyone else would need to rely on it.  I understand that it’s widely used – my point is that it’s in no way necessary to practically debug web apps, or to rhetorically advance the state of the art of programming, metaprogramming, et. al.  However, it’s a widely used feature by many fine folk I’d consider top-notch developers, so there you have it.

In class, Dr. Parson presented us with a direct example of how eval() is used in some of his upper level classes.  He said that eval() is used to generate functions which get attached to classes (and thus become members) at runtime, saving the developer many pointless repetitive keystrokes writing banal code.  I agree – eval() can have a purpose here.  But not in the production system.  No – a more responsible use of eval() would be in a developer script to generate these methods in a blank file, and then copy/paste the code into the class definition.  You get the same effect: developers hitting less keys (here, here!) but you avoid the pitfalls of having eval() run every time your application fires up.

Those who claim that Python (and PHP, JavaScript, et. al.) are themselves giant eval()s are absolutely correct in their assertion; however they draw illegitimate conclusions from the fact.  That premise isn’t sufficient ground to argue that eval() is just dandy.  This type of faux logic is tantamount replacing the drivers seat of your sedan with a Go-Kart – you could make a hundred arguments in favor of doing so, but at the end of th day, it just doesn’t make sense.

For those of you who caught my minor rhetorical shift there, you’ll note that I’m essentially qualifying my thesis with the following: in the final product.  Developers create throw-away one-time “scriptlets” all the time to get quick and dirty jobs done.  I go through several per day, just in my quest to have the computer do what it’s better than me at (which is iteratively following rules).  Go ahead – instantiate a Knife object, invent some code involving butter, toss it to eval(), slam that in a loop, and run it until your process reaches it’s virtual memory limit* – but that type of thing appears in your final product (the code that matters to everyone else), then I submit that you could be doing a better job.  Perhaps you aren’t doing a bad job, but you could be doing better.  Maybe I’m a perfectionist, maybe I’m just a little loopy, but whenever I get the feeling that I could be doing something better, it bugs me until I go back and address it.

*This reference comes from an absolutely hilarious quip Dr. Parson made at the end of his rebuttal of the original post.

Failing Forward

June 11th, 2009 Chris 2 comments

Due to issues with my current registrar (1&1) it looks like I may not be moving the site this week after all.  That detail remains to be clarified.  However, in preparing to move the site, I realized just how inappropriate and… downright stuffy the title “Programming with Poise” was.  I pondered for only a few minutes before discovering what I think to be a perfect fit.  Lately, I’ve been thinking a lot about the concept/mindset of “failing forward” – and it’s a catchy phrase, too.

On failing forward: We all screw up.  Sometimes it’s big, sometimes it’s little.  But it does happen.  What makes the difference in {your job, your relationship, your newest project, your diet, whatever} is how you handle that failure.  Do you run and hide?  Ignore it?  Point the finger?  Shrug it off?  Strong-arm your way past it?  Or do you acknowledge the failure, embrace the consequences, learn from the experience, and save the little pearl of wisdom gained for the “next time”?

Relocation

June 6th, 2009 Chris 2 comments

Within the next week or so, I plan on migrating this site to a different host.  Here’s my plan, so as never to experience any downtime – I hope the procedure proves helpful for someone, and if you’ve got any thoughts on how to more smoothly achieve the same result, I’d love to hear from you.

  1. Update the DNS TTL to a small value (I’ll choose 1 hour).
  2. Configure the new web host, and copy the website itself to it’s new home.
  3. Disable commenting on both sites.
  4. Copy the old database content to the new host.
  5. Update the DNS records to point to the new host.
  6. After about 24 hours, take the old site down.
  7. Enable commenting on the new site.
  8. Change the DNS TTL back to something more reasonable (say, 1 week).

In this fashion, I’m able to transfer the site to an entirely new web host, and the only interruption in service will be the disabling of comments – but all of my content remains available.  If I’m not generating any new content (a given) and neither are my readers (given that I’ve frozen discussion), then I can have two copies of the site running simultaneously with no negative ramifications.  When I’m certain that there are no oustanding stale DNS records, I remove the old site, and allow discussion again.  Voi la.

What does your nail look like?

June 4th, 2009 Chris No comments

In a previous post, I outlined a habit I have of using, over-using, and downright abusing “new” tools that I learn for a while – until their novelty wears off I start to learn the line between when it is, and when it is not, appropriate to use them.  Here’s the gist:

I call this the New Solution Syndrome.  The person has been given a new tool.  A new way of doing things.  A solution in search of a problem.  People (when subject to the New Solution Syndrome) will then go, and try to find problems to solve with their new tool, even if it isn’t necessarily the quickest, cheapest, or more significantly, the best tool to use for the job.

Posted to that article was a particularly concise comment:

Think of it this way: you used the hammer so much in the right and wrong situations that you developed a very accurate picture of what a nail is supposed to look like.

Reflecting on this, I had one of those coding “poof” moments.  One of those thoughts that you – and every mother’s son what calls himself a coder – has thought thousands of times before.  Sometimes, though, I think we have these thoughts without ever… truly… groking the real concepts at hand.  Say you have some sort of tool (i.e. a hammer) to choose for each of the following:

  • Before you go to create a repository for that new project, figure out what the nail looks like.
  • Before you go to author a brand-new file, figure out what the nail looks like.
  • Before you write another function/method, figure out what the nail looks like.
  • Before your mind starts wandering about how best to re-write that bottlenecked bit of code, figure out what the nail looks like.
  • Before anyone can cloud your vision and poison your thoughts with their own opinions about architecture, figure out what the nail looks like.
  • Don’t even think about changing that server config file before you figure out what the nail looks like.

“Duh,” “obvious,” and “what does this guy take me for?” are all responses I would expect you to be thinking at the moment.  But take thirty seconds, and think about some recent software development faux-pas.  I don’t mean general coding blunders – a seg fault here, an array index out of bounds there, a few NullPointerExceptions and a nice infinitely recursing function – those are the minutiae of our work.  I want you to think of system design flaws, overly complex interfaces, major scaling problems, or a case of software simply not meeting its spec.

Why do these things happen?  [And they do happen, in the real world.]

Because someone, somewhere down the line, sat down and grabbed a hammer before they really knew what their nail looked like.  Maybe they sat down and spilt too much rhetoric at a board meeting, maybe they stood and drew inaccurate pictures on a whiteboard, maybe they sent that usability report off in too much of a hurry, or maybe they chose a language based on mere whim and personal preference.  It doesn’t matter.  That person didn’t take the time to fully comprehend the task at hand, and now there are issues.

If that person was you, then it’s your fault for failing to look before you lept.  Accept that, move on, and pay attention next time.

If that person was a superior, then it’s your fault for letting someone else think for you.  Accept that, move on, and pay attention next time.

If that person was an inferior, then it’s your fault for leading them astray.  Accept that, move on, and pay attention next time.

At the risk of digressing into a societal flame war here, I need to say that people are far too concerned about shifting blame, and about appointing credit, and not nearly concerned enough about getting the job done right.  Coding well is as much about taking responsibility and ownership, and about making informed decions, as it is about writing code.  One of the easiest areas to start this habit is in choosing which tool(s) to use for a given job.  Of course – you can’t know what kind of hammer you need until you figure out what the nail looks like.

Google Wave

May 29th, 2009 Chris No comments

In a previous post, I outlined the slow but sure initiative on behalf of Google to step properly into the realm social networking.

As it turns out, just the other day at their I/O conference, Google unveiled their big guns in the social space – Google Wave.

Google_Wave_logo

Instantly blurring the aging lines between Email and Instant Messaging, and at once meshing the capabilities of wildly popular platforms like AIM, Flickr, Facebook, and Moodle, Google Wave has me floored.  Messages are both real-time and persistent, and everything is versioned (which allows a newcomer any particular conversation to “replay” the entire sequence of modifications).  Adding photos (and I’d imagine some other types of files) from your local machine to a wave is as simple as drag-n-drop.  As conversations diverge and branch, and different people become active in different ways, Wave allows highly customizable, fine grained controls over each and every such branch.  Collaboration on wave-d text happens inline, in real time, and can be distributed (and all is “replay”able)

The service requires no special browser plug-ins or client applications – just an active account, and standards compliant browser (taking heavy advantage of the provisions in the HTML5 spec).  The Wave Federation Protocol is an Open Standard, and Google’s Wave Interface will be made Open Source (the API is already published – but still in flux).  There is a rich extension environment and plug-in environment on both client and server sides.  Google has already written several, including integration points with Google Maps, Twitter, and even – you might have to re read this once or twice – dynamic, real-time language translation.

Whatever Google says (or doesn’t say), I can just smell this interface being the preferred integration point among products like Gmail, Reader, Calendar, et. al. – even though there were no allusions to that being the case in the video.  I would very highly recommend – if you’ve got an hour and twenty minutes to spare or not – watching the full presentation to get the whole story (can be found on the product home page).  I’ll be waiting rather imaptiently for a chance to sign up.  Kudos to the Google Maps brothers Lars and Jens Rasmussen!

Information Retrieval

May 27th, 2009 Chris No comments

Information retrieval. It’s a high-brow word for search. I’ve actually begun breathing new life into an old search-related project in the past few days, and upon reading What 255 characters looks like, a particular section caught my eye.

It’s true that in most cases it won’t make a difference. However, if you need to index and search the field, you should think carefully before blindly using TEXT. The data in TEXT type fields are stored outside the table itself, using only a few bytes for pointer information. This means that TEXT fields are not indexed, while VARCHAR fields are. This can have a tremendous effect on your SQL query speeds, as generally larger TEXT fields increase query time exponentially. Even if we take indexing out of the picture, the external storage of TEXT fields means that you’ll still see generally faster searches with VARCHAR.

Frank here presents nothing but a rational, fact based explanation of the biggest [performance] differences between VARCHAR and TEXT fields in MySQL.  But it begs a rather distinct question, in my mind.  What do you do, if you have too much data for a VARCHAR field, but require fast searching nonetheless?  My first reaction, is not to solve a problem that’s already been solved.  There are third party applications, such as Sphinx, that do a wonderful job of indexing content.  After all, Jeff Atwood makes this point very clearly in his perfectly-named post Don’t Reinvent The Wheel, Unless You Plan on Learning More About Wheels.

Done and done.  Problem solved.  But… what if you really do want to just learn more about wheels information retrieval?  Firstly, I would highly recommend reading both Frank and Jeff’s posts.  You should, then, have a rough idea of where I’m coming from to NOT use a Sphinx-esque product for my new venture.  That all said, let’s get started.  We’ll piggyback on Frank’s use of email searching throughout our examples.

We’ll create a table to store emails, and keep it simple, for the sake of example.  I’m using MySQL 5, and I like UTF-8.

CREATE TABLE `foobar`.`emails` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`subject` VARCHAR( 255 ) NOT NULL ,
`body` TEXT NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci

Here we have a primary key field, a subject (VARCHAR) and a body (TEXT).

Note: In Information Retrieval (“IR”) this is called the forward index.  If you know the ID of an email, you can get all the tokens in that email.

Let’s assume you wanted to search all emails for the word “fortran”.  In a plain vanilla world, your query might look something like

SELECT * FROM emails WHERE body LIKE '%fortran%' OR subject LIKE '%fortran%';

This means, that MySQL must fetch, load, and scan nearly every character of every email in the table.  If you’ve got three messages, no big deal.  My Gmail inbox, however, at the moment is around 8000 conversations – about 570MB.  That would take a while.  The alternative, is to put effort into doing all this scanning ahead of time.  And that’s the essence of the term “indexing”.

We’ll create a script (or program, or stored procedure, etc) that goes through, and finds all the words ahead of time for us.  And then we use a level of indirection at search-time, to get our information.  Take the following three emails:

Email #1

Subject: programming class

Body: I have to write a fortran program for my computer science class.

Email #2

Subject: programming class

Body: Have you ever written fortran before?

Email #3

Subject: spam

Body: I will try to trick you into clicking these links.

Our script will tokenize each text field of interest, compile a list of all unique tokens. and store them in a database table.  A table, for starters, something like this

CREATE TABLE `foobar`.`inverted_index` (
`token` VARCHAR( 255 ) NOT NULL ,
`emails` TEXT NOT NULL ,
UNIQUE (
`token`
)
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci

For our three sample emails, the unique tokens are the set: { a, before, class, clicking, computer, ever, for, fortran, have, i, into, links, my, program, programming, science, spam, these, to, trick, try, will, write, written, you }

So, we insert these tokens, one by one.  We then update the table – for each token, we find the set of emails in which that token appears, and store their IDs.  For example, the row for token “fortran” would look like: “fortran” => {1,2}.  Another row might be “to” => {1,3}, or “written” => {2}.  We store all of these in the database table called “inverted_index”.  It’s called the inverted index because, conversely from the forward index, given a token, we know all emails containing that token.

And that’s about it for the most basic index design.  Let’s go back to our search query.  It now becomes:

SELECT * FROM inverted_index WHERE token = ‘fortran’;

This query tells MySQL to scan one column of a sorted and uniquely indexed VARCHAR field.  A WHOLE lot faster than our original search query.  I’ve just run a comparison search on my database.  The corpus is a little over 250,000 rows, and consumes about 50MB on disk.  I ran a search for any items with both of two tokens.  I ran an “old-school” MySQL LIKE query looking for two different tokens – a full text scan.  The query took 0.7504 seconds, according to MySQL.  When I ran our more advanced search, a query for those same two tokens came back in just 0.0005 seconds.

Don’t be disillusioned though – that query speed and simplicity comes at a defined cost – the cost of set operations on the application side.  If you’re searching for emails which contain all of three different terms, you’ve got to find the intersection of all three sets returned by our new query on your own.  However – with very large corpuses, this overhead is quickly compensated for (and then some).

And there you have it – the diving platform of indexed text search.  Of course, even with this simple setup, there are dozens of optimizations one could make.  Maybe I’ll revisit this topic again sometime, with a more advanced look at some of the concepts presented.

Find and redirection

May 26th, 2009 Chris No comments

The find utility is one of Unix’s most powerful and flexible utilities.  But, as I’ve just spent a half hour learning, it can’t do everything.

I was busy on the command line of our primary web server here at the office, and needed a quick command to clear out log files.  Seemed simple enough.  I tried (variations of)

$> find . -type f -name "*.log" -exec cat /dev/null > {} \;

I’ll break it down, in case you’re unfamiliar with findfind, at it’s most basic, begins at the specified target location, and recursively descends, printing all filesystem records it sees on it’s way downward (I tell it “current directory” with the “.”).  -type f specifies that only regular files are to be considered in the output, and no directories, symlinks, pipes, etc..  -name "*.log" tells find that I only want entries matching the shell expansion *.log.

find is a great utility if you stop there.  But the -exec command is the real gold.  -exec tells find not to print the entires it locates to stdout, but rather to execute the given command for each file found.  For example, if I did the following…

$> mkdir d
$> touch d/bar
$> touch d/foo

Then

$> find d -exec command -o {} \;

would be equivalent to the following

$> command -o d
$> command -o d/bar
$> command -o d/foo

Find runs command, with option o, on each viable entry found, replacing {} with the name of the file.

So, when I tried

$> find . -type f -name "*.log" -exec cat /dev/null > {} \;

I figured that for every file ending in .log, find would overwrite it’s contents with a null string.  Not so.  The result of the above command, was that, in my current working directory, I had a file called “{}” and my logs were untouched.  I figured I had the solution in the bag, so to speak.

$> find . -type f -name "*.log" -exec cat /dev/null \> {} \;

Just escape the redirection character right?  Nope.  Try and try as I might, I couldn’t get it to work.  Finally, I gave up and implemented

$> find . -type f -name "*.log" -exec cp /dev/null {} \;

Which is the same effect as desired. but I was still bothered.  Upon further research, it turns out that find just can’t redirect by itself, so if you need something like that, you’ll have to resort to invoking extra shells or interpreters, like so

$> find . -type f -name "*.log" -exec sh -c "cat /dev/null > {}" \;

This works on my box and on our web server (One’s Debian, one’s Fedora, both using bash) however from what I’ve been reading, this isn’t portable, so your mileage may vary.  (Specifically, the issue lies around find [not] being able to properly replace the value of {} when nested inside of quotes and things)

Silverlight

May 15th, 2009 Chris No comments

I grow weary of Microsoft’s “me too” attitude.

Shuffling through my RSS feeds, I saw a post on TechCrunch about the newest in Microsoft’s “Laptop Hunters” ad campaign, featuring Lauren and Sue.  I was intrigued (curious to know what angle they’re taking this time), and proceeded to peruse the article, and eventually click through to watch the video.

When I got to the actual video page (which was just a single click-through), I waited a moment for all of the page to load, and then was prompted to install Silverlight in order to watch their commercial.

No. I understand they want to be competitive, but that’s just bullying.

bully

I’m not a fan of Microsoft.  Some of their software is just… poor (Media Player, Internet Explorer, Access, Front Page, etc).  There could be an entire Fail Whale comic book devoted to the various Windows operating systems (it’s been downhill since 3.1).  Their response to poor code is simply to shove the next version down everyone’s throat.  Now… I’m not worth to be listened to, however, if I don’t give credit where it’s due – the Microsoft Office suite (read: Word, PowerPoint, and Excel) is an absolutely phenomenal toolset.  Also, even as I cringe to admit it, I’m a big fan of Visual Studio 2008 (specifically with C#).

Now, we could argue that any software company that’s big enough to be considered “on the scene” can be accused of the above misdoings in varying measures and cases.  Apple, Google, Adobe, you name them.  So let’s just be fair off the bat.  My real beef with Microsoft, is their historical, consistent, and predictable pattern of unfair play, strongarm tactics, and unsatiable “me too” attitude.  They’re the big, bullying, pontificating, “one upper” of the playground, and it ruins everone’s fun – especially since most of the time, they don’t really “one up” anyone… they just spin their wheels in mediocrity.

What was Microsoft’s focus on the search market before Google got strong?  Their attention to console games until PlayStation was a hit?  Did they care about browsers until Firefox started to lean in on their “turf”?  No.  They sat back on their laurels, waiting for someone to make a move, and when it happened, they used their massive war chest and undeserved Windows market share to push their version of the product.

And their version, historically, usually just isn’t as good.  By far.  In any case, it certainly doesn’t offer me exceptional value over the competition.

Silverlight is the latest case of Microsoft’s “me too” compulsion.  They’re running web ads, for goodness sake.  In Silverlight.  How much market share does Silverlight have?  And Flash?  I’m sorry – you want me to install additional software to do what I can already do? Just because you had something to prove?  Why do you keep re-inventing the wheel?  If I were a stockholder, I’d be pretty pissed that Microsoft has continued to pour so much money into battling existing products for market space they already have, rather than putting that energy into coming up with something novel, so that instead of sinking money fighting for share in existing markets, they could be creating entirely new markets and reaping the benefits.

Back to my initial statement.  If you feel compelled to jump in the boat and join the party [late], at least do it right.  Offer me something that the other guy doesn’t.  What do Live Search, Live Search Maps, XBox, Silverlight, and the Zune have in common?  The fact that they were all “me too!’s” (Google, MapQuest, PlayStation, Flash, and iPod, respectively) that Microsoft jumped into without thought to really delivering a great product that gives consumers exceptional value – they just did it to keep up with the Joneses.

Note: I include XBox in my list of mediocre “me too’s” even though I enjoy playing it regularly.  However, it belongs there because the only reason I use the console is because of Halo.  Which, is all credit to Bungie, not Microsoft.  Redmond was insanely fortunate to have such a killer app bundled right with the system.  I should liked to have been the fly on the wall in an alternate universe where Halo didn’t exist, to see what people made of the XBox then.

Class Wrapper #2: Aspell

May 14th, 2009 Chris No comments

At work, I maintain a system in charge of aggregating hundreds of thousands of data records on a nightly basis. Some of this data comes from XML, some from CSV, and some from clients who ask us to crawl their site to retrieve their data. The system works really well (although aging badly).

There’s one problem though – the data, no matter in what form it hits our system – all comes from people. I don’t like people, because they make mistakes. They misspell. They omit some things, and add too much elsewhere. They don’t know how to normalize data sets.

“Quit complaining,” you mumble. “That’s your job!”

Quite right. For about a year now, the sum total of my issues with this system have been focused around reconciling poorly formed data. Most of the time, my data processing is due to a misspelling, or a mistyping, or something of the sort. And when I say “most of the time,” I mean most of the time. I’d wager easily 90%, or more. Here at the office full time again (since I’m back from school for the summer) I’ve been asked to shift my focus to the clarity and accuracy of our database almost exclusively (prior to this, I’ve done everything from system administration and network administration, to office tech support, to even some customer support).

Happy to oblige (no more constant distractions), I had time to sit back for a couple of hours and survey our technologies’ landscape. I quickly realized that spelling, typing, and translation errors constituted most of our woes.

Enter GNU Aspell, recommended to me about a year ago by a great guy, one of the creators of Birdstack. I was promptly distracted from looking into Aspell, and the idea hung motionless in my subconscious “to do” list until earlier this week.

Designed as a replacement for Ispell (an old Unix command for multilingual spell checking), Aspell’s major advantage over alternatives is that it does a great job at suggesting possible replacement words for misspellings. Add to that support for multiple custom dictionaries and also personal (and also session-based) dictionaries, and Aspell (with it’s C language binding) makes the perfect choice for domain-specific text correction.

The C binding is very easy to use (see example). You create a config object, you create a spellchecker object using the config object, and then you go to town, check()ing whether or not words are in the dictionary, and optionally returning linked lists of correction suggest()ions.

First, I attempted to home-grow a solution to the spelling situation. It didn’t seem like a large enough problem to warrant inclusion of a third party system. Turns out, it is. In my research, I discovered two promising algorithms of detecting misspellings and ranking suggestions. One, was the old fashioned Russel Soundex and it’s derivatives and cousins, such as the Double Metaphone, and the NYSIIS. These algorithms attempt to capture the phonetic fingerprint of a word in a short sequence of characters. Thus, words that sound similar, will have similar or matching soundex codes.

The other was the Levenshtein algorithm. If you’ve ever done a word ladder puzzle, you’re already familiar with the idea. Taking two strings, this clever algorithm compares them side by side, to calculate what’s called the “edit distance,” or, the number of operations (add a letter, delete a letter, change a letter from one thing to another) required to make the transformation between one string and the other. The lower the edit distance, the more similarly spelled the two words are.

Aspell is brilliant in that it precalculates the soundex of each word in it’s dictionary files, and then at runtime compares the edit distances of the soundexes to formulate the basis of it’s suggestion algorithm. Beautifully elegant. Here’s my wrapper. It’s still pretty early in development, so I’ll probably be updating it a few times over the next couple of weeks.

Header: speller.h

Implementation: speller.cpp

Test driver: speller_main.cpp

It’s as straightforward to use as I could make it. Create a Speller object, config()ure it to your preference, init()ialize it. The extra init() call, which I normally try to avoid, was necessary because of the way that the underlying AspellSpeller object is configured (with respect to the AspellConfig object) – unless I wanted to roll a 2 class design, which seemed even less usable.

You’ll need to have installed the Aspell development package to use this, and then link with -laspell. Cheers!