Archive

Archive for May, 2009

Google Wave

May 29th, 2009 Chris No comments

In a previous post, I outlined the slow but sure initiative on behalf of Google to step properly into the realm social networking.

As it turns out, just the other day at their I/O conference, Google unveiled their big guns in the social space – Google Wave.

Google_Wave_logo

Instantly blurring the aging lines between Email and Instant Messaging, and at once meshing the capabilities of wildly popular platforms like AIM, Flickr, Facebook, and Moodle, Google Wave has me floored.  Messages are both real-time and persistent, and everything is versioned (which allows a newcomer any particular conversation to “replay” the entire sequence of modifications).  Adding photos (and I’d imagine some other types of files) from your local machine to a wave is as simple as drag-n-drop.  As conversations diverge and branch, and different people become active in different ways, Wave allows highly customizable, fine grained controls over each and every such branch.  Collaboration on wave-d text happens inline, in real time, and can be distributed (and all is “replay”able)

The service requires no special browser plug-ins or client applications – just an active account, and standards compliant browser (taking heavy advantage of the provisions in the HTML5 spec).  The Wave Federation Protocol is an Open Standard, and Google’s Wave Interface will be made Open Source (the API is already published – but still in flux).  There is a rich extension environment and plug-in environment on both client and server sides.  Google has already written several, including integration points with Google Maps, Twitter, and even – you might have to re read this once or twice – dynamic, real-time language translation.

Whatever Google says (or doesn’t say), I can just smell this interface being the preferred integration point among products like Gmail, Reader, Calendar, et. al. – even though there were no allusions to that being the case in the video.  I would very highly recommend – if you’ve got an hour and twenty minutes to spare or not – watching the full presentation to get the whole story (can be found on the product home page).  I’ll be waiting rather imaptiently for a chance to sign up.  Kudos to the Google Maps brothers Lars and Jens Rasmussen!

Information Retrieval

May 27th, 2009 Chris No comments

Information retrieval. It’s a high-brow word for search. I’ve actually begun breathing new life into an old search-related project in the past few days, and upon reading What 255 characters looks like, a particular section caught my eye.

It’s true that in most cases it won’t make a difference. However, if you need to index and search the field, you should think carefully before blindly using TEXT. The data in TEXT type fields are stored outside the table itself, using only a few bytes for pointer information. This means that TEXT fields are not indexed, while VARCHAR fields are. This can have a tremendous effect on your SQL query speeds, as generally larger TEXT fields increase query time exponentially. Even if we take indexing out of the picture, the external storage of TEXT fields means that you’ll still see generally faster searches with VARCHAR.

Frank here presents nothing but a rational, fact based explanation of the biggest [performance] differences between VARCHAR and TEXT fields in MySQL.  But it begs a rather distinct question, in my mind.  What do you do, if you have too much data for a VARCHAR field, but require fast searching nonetheless?  My first reaction, is not to solve a problem that’s already been solved.  There are third party applications, such as Sphinx, that do a wonderful job of indexing content.  After all, Jeff Atwood makes this point very clearly in his perfectly-named post Don’t Reinvent The Wheel, Unless You Plan on Learning More About Wheels.

Done and done.  Problem solved.  But… what if you really do want to just learn more about wheels information retrieval?  Firstly, I would highly recommend reading both Frank and Jeff’s posts.  You should, then, have a rough idea of where I’m coming from to NOT use a Sphinx-esque product for my new venture.  That all said, let’s get started.  We’ll piggyback on Frank’s use of email searching throughout our examples.

We’ll create a table to store emails, and keep it simple, for the sake of example.  I’m using MySQL 5, and I like UTF-8.

CREATE TABLE `foobar`.`emails` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`subject` VARCHAR( 255 ) NOT NULL ,
`body` TEXT NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci

Here we have a primary key field, a subject (VARCHAR) and a body (TEXT).

Note: In Information Retrieval (“IR”) this is called the forward index.  If you know the ID of an email, you can get all the tokens in that email.

Let’s assume you wanted to search all emails for the word “fortran”.  In a plain vanilla world, your query might look something like

SELECT * FROM emails WHERE body LIKE '%fortran%' OR subject LIKE '%fortran%';

This means, that MySQL must fetch, load, and scan nearly every character of every email in the table.  If you’ve got three messages, no big deal.  My Gmail inbox, however, at the moment is around 8000 conversations – about 570MB.  That would take a while.  The alternative, is to put effort into doing all this scanning ahead of time.  And that’s the essence of the term “indexing”.

We’ll create a script (or program, or stored procedure, etc) that goes through, and finds all the words ahead of time for us.  And then we use a level of indirection at search-time, to get our information.  Take the following three emails:

Email #1

Subject: programming class

Body: I have to write a fortran program for my computer science class.

Email #2

Subject: programming class

Body: Have you ever written fortran before?

Email #3

Subject: spam

Body: I will try to trick you into clicking these links.

Our script will tokenize each text field of interest, compile a list of all unique tokens. and store them in a database table.  A table, for starters, something like this

CREATE TABLE `foobar`.`inverted_index` (
`token` VARCHAR( 255 ) NOT NULL ,
`emails` TEXT NOT NULL ,
UNIQUE (
`token`
)
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci

For our three sample emails, the unique tokens are the set: { a, before, class, clicking, computer, ever, for, fortran, have, i, into, links, my, program, programming, science, spam, these, to, trick, try, will, write, written, you }

So, we insert these tokens, one by one.  We then update the table – for each token, we find the set of emails in which that token appears, and store their IDs.  For example, the row for token “fortran” would look like: “fortran” => {1,2}.  Another row might be “to” => {1,3}, or “written” => {2}.  We store all of these in the database table called “inverted_index”.  It’s called the inverted index because, conversely from the forward index, given a token, we know all emails containing that token.

And that’s about it for the most basic index design.  Let’s go back to our search query.  It now becomes:

SELECT * FROM inverted_index WHERE token = ‘fortran’;

This query tells MySQL to scan one column of a sorted and uniquely indexed VARCHAR field.  A WHOLE lot faster than our original search query.  I’ve just run a comparison search on my database.  The corpus is a little over 250,000 rows, and consumes about 50MB on disk.  I ran a search for any items with both of two tokens.  I ran an “old-school” MySQL LIKE query looking for two different tokens – a full text scan.  The query took 0.7504 seconds, according to MySQL.  When I ran our more advanced search, a query for those same two tokens came back in just 0.0005 seconds.

Don’t be disillusioned though – that query speed and simplicity comes at a defined cost – the cost of set operations on the application side.  If you’re searching for emails which contain all of three different terms, you’ve got to find the intersection of all three sets returned by our new query on your own.  However – with very large corpuses, this overhead is quickly compensated for (and then some).

And there you have it – the diving platform of indexed text search.  Of course, even with this simple setup, there are dozens of optimizations one could make.  Maybe I’ll revisit this topic again sometime, with a more advanced look at some of the concepts presented.

Find and redirection

May 26th, 2009 Chris No comments

The find utility is one of Unix’s most powerful and flexible utilities.  But, as I’ve just spent a half hour learning, it can’t do everything.

I was busy on the command line of our primary web server here at the office, and needed a quick command to clear out log files.  Seemed simple enough.  I tried (variations of)

$> find . -type f -name "*.log" -exec cat /dev/null > {} \;

I’ll break it down, in case you’re unfamiliar with findfind, at it’s most basic, begins at the specified target location, and recursively descends, printing all filesystem records it sees on it’s way downward (I tell it “current directory” with the “.”).  -type f specifies that only regular files are to be considered in the output, and no directories, symlinks, pipes, etc..  -name "*.log" tells find that I only want entries matching the shell expansion *.log.

find is a great utility if you stop there.  But the -exec command is the real gold.  -exec tells find not to print the entires it locates to stdout, but rather to execute the given command for each file found.  For example, if I did the following…

$> mkdir d
$> touch d/bar
$> touch d/foo

Then

$> find d -exec command -o {} \;

would be equivalent to the following

$> command -o d
$> command -o d/bar
$> command -o d/foo

Find runs command, with option o, on each viable entry found, replacing {} with the name of the file.

So, when I tried

$> find . -type f -name "*.log" -exec cat /dev/null > {} \;

I figured that for every file ending in .log, find would overwrite it’s contents with a null string.  Not so.  The result of the above command, was that, in my current working directory, I had a file called “{}” and my logs were untouched.  I figured I had the solution in the bag, so to speak.

$> find . -type f -name "*.log" -exec cat /dev/null \> {} \;

Just escape the redirection character right?  Nope.  Try and try as I might, I couldn’t get it to work.  Finally, I gave up and implemented

$> find . -type f -name "*.log" -exec cp /dev/null {} \;

Which is the same effect as desired. but I was still bothered.  Upon further research, it turns out that find just can’t redirect by itself, so if you need something like that, you’ll have to resort to invoking extra shells or interpreters, like so

$> find . -type f -name "*.log" -exec sh -c "cat /dev/null > {}" \;

This works on my box and on our web server (One’s Debian, one’s Fedora, both using bash) however from what I’ve been reading, this isn’t portable, so your mileage may vary.  (Specifically, the issue lies around find [not] being able to properly replace the value of {} when nested inside of quotes and things)

Silverlight

May 15th, 2009 Chris No comments

I grow weary of Microsoft’s “me too” attitude.

Shuffling through my RSS feeds, I saw a post on TechCrunch about the newest in Microsoft’s “Laptop Hunters” ad campaign, featuring Lauren and Sue.  I was intrigued (curious to know what angle they’re taking this time), and proceeded to peruse the article, and eventually click through to watch the video.

When I got to the actual video page (which was just a single click-through), I waited a moment for all of the page to load, and then was prompted to install Silverlight in order to watch their commercial.

No. I understand they want to be competitive, but that’s just bullying.

bully

I’m not a fan of Microsoft.  Some of their software is just… poor (Media Player, Internet Explorer, Access, Front Page, etc).  There could be an entire Fail Whale comic book devoted to the various Windows operating systems (it’s been downhill since 3.1).  Their response to poor code is simply to shove the next version down everyone’s throat.  Now… I’m not worth to be listened to, however, if I don’t give credit where it’s due – the Microsoft Office suite (read: Word, PowerPoint, and Excel) is an absolutely phenomenal toolset.  Also, even as I cringe to admit it, I’m a big fan of Visual Studio 2008 (specifically with C#).

Now, we could argue that any software company that’s big enough to be considered “on the scene” can be accused of the above misdoings in varying measures and cases.  Apple, Google, Adobe, you name them.  So let’s just be fair off the bat.  My real beef with Microsoft, is their historical, consistent, and predictable pattern of unfair play, strongarm tactics, and unsatiable “me too” attitude.  They’re the big, bullying, pontificating, “one upper” of the playground, and it ruins everone’s fun – especially since most of the time, they don’t really “one up” anyone… they just spin their wheels in mediocrity.

What was Microsoft’s focus on the search market before Google got strong?  Their attention to console games until PlayStation was a hit?  Did they care about browsers until Firefox started to lean in on their “turf”?  No.  They sat back on their laurels, waiting for someone to make a move, and when it happened, they used their massive war chest and undeserved Windows market share to push their version of the product.

And their version, historically, usually just isn’t as good.  By far.  In any case, it certainly doesn’t offer me exceptional value over the competition.

Silverlight is the latest case of Microsoft’s “me too” compulsion.  They’re running web ads, for goodness sake.  In Silverlight.  How much market share does Silverlight have?  And Flash?  I’m sorry – you want me to install additional software to do what I can already do? Just because you had something to prove?  Why do you keep re-inventing the wheel?  If I were a stockholder, I’d be pretty pissed that Microsoft has continued to pour so much money into battling existing products for market space they already have, rather than putting that energy into coming up with something novel, so that instead of sinking money fighting for share in existing markets, they could be creating entirely new markets and reaping the benefits.

Back to my initial statement.  If you feel compelled to jump in the boat and join the party [late], at least do it right.  Offer me something that the other guy doesn’t.  What do Live Search, Live Search Maps, XBox, Silverlight, and the Zune have in common?  The fact that they were all “me too!’s” (Google, MapQuest, PlayStation, Flash, and iPod, respectively) that Microsoft jumped into without thought to really delivering a great product that gives consumers exceptional value – they just did it to keep up with the Joneses.

Note: I include XBox in my list of mediocre “me too’s” even though I enjoy playing it regularly.  However, it belongs there because the only reason I use the console is because of Halo.  Which, is all credit to Bungie, not Microsoft.  Redmond was insanely fortunate to have such a killer app bundled right with the system.  I should liked to have been the fly on the wall in an alternate universe where Halo didn’t exist, to see what people made of the XBox then.

Class Wrapper #2: Aspell

May 14th, 2009 Chris No comments

At work, I maintain a system in charge of aggregating hundreds of thousands of data records on a nightly basis. Some of this data comes from XML, some from CSV, and some from clients who ask us to crawl their site to retrieve their data. The system works really well (although aging badly).

There’s one problem though – the data, no matter in what form it hits our system – all comes from people. I don’t like people, because they make mistakes. They misspell. They omit some things, and add too much elsewhere. They don’t know how to normalize data sets.

“Quit complaining,” you mumble. “That’s your job!”

Quite right. For about a year now, the sum total of my issues with this system have been focused around reconciling poorly formed data. Most of the time, my data processing is due to a misspelling, or a mistyping, or something of the sort. And when I say “most of the time,” I mean most of the time. I’d wager easily 90%, or more. Here at the office full time again (since I’m back from school for the summer) I’ve been asked to shift my focus to the clarity and accuracy of our database almost exclusively (prior to this, I’ve done everything from system administration and network administration, to office tech support, to even some customer support).

Happy to oblige (no more constant distractions), I had time to sit back for a couple of hours and survey our technologies’ landscape. I quickly realized that spelling, typing, and translation errors constituted most of our woes.

Enter GNU Aspell, recommended to me about a year ago by a great guy, one of the creators of Birdstack. I was promptly distracted from looking into Aspell, and the idea hung motionless in my subconscious “to do” list until earlier this week.

Designed as a replacement for Ispell (an old Unix command for multilingual spell checking), Aspell’s major advantage over alternatives is that it does a great job at suggesting possible replacement words for misspellings. Add to that support for multiple custom dictionaries and also personal (and also session-based) dictionaries, and Aspell (with it’s C language binding) makes the perfect choice for domain-specific text correction.

The C binding is very easy to use (see example). You create a config object, you create a spellchecker object using the config object, and then you go to town, check()ing whether or not words are in the dictionary, and optionally returning linked lists of correction suggest()ions.

First, I attempted to home-grow a solution to the spelling situation. It didn’t seem like a large enough problem to warrant inclusion of a third party system. Turns out, it is. In my research, I discovered two promising algorithms of detecting misspellings and ranking suggestions. One, was the old fashioned Russel Soundex and it’s derivatives and cousins, such as the Double Metaphone, and the NYSIIS. These algorithms attempt to capture the phonetic fingerprint of a word in a short sequence of characters. Thus, words that sound similar, will have similar or matching soundex codes.

The other was the Levenshtein algorithm. If you’ve ever done a word ladder puzzle, you’re already familiar with the idea. Taking two strings, this clever algorithm compares them side by side, to calculate what’s called the “edit distance,” or, the number of operations (add a letter, delete a letter, change a letter from one thing to another) required to make the transformation between one string and the other. The lower the edit distance, the more similarly spelled the two words are.

Aspell is brilliant in that it precalculates the soundex of each word in it’s dictionary files, and then at runtime compares the edit distances of the soundexes to formulate the basis of it’s suggestion algorithm. Beautifully elegant. Here’s my wrapper. It’s still pretty early in development, so I’ll probably be updating it a few times over the next couple of weeks.

Header: speller.h

Implementation: speller.cpp

Test driver: speller_main.cpp

It’s as straightforward to use as I could make it. Create a Speller object, config()ure it to your preference, init()ialize it. The extra init() call, which I normally try to avoid, was necessary because of the way that the underlying AspellSpeller object is configured (with respect to the AspellConfig object) – unless I wanted to roll a 2 class design, which seemed even less usable.

You’ll need to have installed the Aspell development package to use this, and then link with -laspell. Cheers!

Fwd:Vault

May 13th, 2009 Chris 1 comment

FwdVault

I had the fortune of getting a private beta invite (to sound official and TechCrunch-y about it) to a really neat startup this morning called Fwd:Vault.

The new service allows the user to create secure file backups via email, and email alone.  Ordinary backup programs require some sort of client/server software, or at least a special program running in the background that needs to be configured.  Not so with Fwd:Vault.  If you want a secure remote copy of a document, a PowerPoint presentation, a photograph, or indeed anything else you could attach to an email and send to someone, all that’s required is for you to send the email(s) [and attachment(s)] to an email address provided by Fwd:Vault.  To recover a document stored in this fashion, you send a request email, and receive the file in an automatically generated reply message.

The system is actually really neat – and it’s fast too.  I sent a picture taken of my girlfriend and I over Easter to the system, and instantly got a reply email with a confirmation of the backup, as well as a code to retrieve the file at a later date.  I went ahead and requested the file back again using the code, and the Fwd:Vault system promptly replied via email with my photo attached.

There is a web interface (currently in development, so I can’t try it yet) that will allow you to view your archived files.  This is going to be important, because if everything is done over email, and your email goes down, or you delete the confirmation emails with the codes in them, then you wouldn’t otherwise be able to retrieve your files… which will prove a pretty big problem.  But from what I’m able to gather, the web interface for archived files sounds like it’s going to be really spot-on.  Fwd:Vault is also said to handle file versioning, but I’ve not yet tested that.

Overall, from the portions of the site that are active at the moment, this is a really neat idea and looks to be developing pretty solidly.  I don’t know what the cap on attachment size is, but I’m definitely going to be using this when it goes live.

Disclosure: Fwd:Vault founder Frank Koehl is a great friend and mentor of mine.  I let him know that I’d be a willing beta tester if ever he wanted one, and he took me up on the offer.  He never asked me to review Fwd:Vault here and I do so only because I think it’s a cool idea.  The content of this review is entirely my own, and entirely subjective.

Perception

May 12th, 2009 Chris No comments

Object oriented programming is a hoax. A good one, I’ll admit; but it’s nothing more than some glorified hand waving and a leap of faith. Say we have an imaginary C library called the Nifty API. Following the C tradition, typical API usage might look something like this…

Listing A:

NIFTY* handle = nifty_init();
nifty_action( handle, 37 );
nifty_other_action( handle, "testing", fp );
nifty_destroy( handle );

Now, it’s easy to guess what the compiler might be doing when it gets to Listing A. There is some library call or resource allocation happening in the nifty_init() call, which returns the address of a custom data structure (a NIFTY pointer). The address of the NIFTY pointer then gets carted around by the application developer and passed into every other Nifty library call. Then, we tell the Nifty library to deallocate those session resources by a call to nifty_destroy() (to which we also pass the NIFTY resource handle).

If the Nifty developers had built instead a native C++ (read: object oriented) API, it’s not too difficult to imagine what such a beast might look like.

Listing B:

Nifty* handle = new Nifty();
handle->action( 37 );
handle->otherAction( "testing", fp );
delete handle;

Listing B is nearly the exact same thing as listing A, and here’s why: Object oriented programming is just a scope trick. Essentially, classes are pseudo-namespaces, in which you can define data and functions (as in any namespace), and the compiler just does some extra legwork for you. Conceptually*, what the compiler does, in Listing B, is a “translation” to the following.

Listing C:

Nifty* handle = Nifty::Nifty( (Nifty*)malloc( sizeof(Nifty) ) );
Nifty::action( handle, 37 );
Nifty::otherAction( handle, "testing", fp );
Nifty::~Nifty( handle );

The call to the Nifty constructor in B.1 takes no arguments, but C.1 passes the result of a malloc() into it. Over lines C.2 and C.3, the reference to the handle gets moved from object syntax into the first parameter slot, and call the methods with the scope resolution operator. C.4 directly calls the Nifty destructor – again, with an extra parameter! What’s going on here? Some of you will already have guessed it. The legendary this pointer. When you instantiate a new class object, the compiler isn’t wasting it’s time copying the code for the class methods over and over – it keeps a global copy of that. Instead, it uses a hidden pointer that gets passed to every nonstatic method call. The this pointer. That’s how the compiler distinguishes between handle->action( 37 ) and other_handle->action( 37 ).

Static members and methods are easy. Let’s say you define a class with ONLY static members and methods, like so…

Listing D:

class Bar {
    public:
        static int a;
        static char b;
        static void fun( void );
};

There is no difference between this code, and if we were to convert this class into a namespace like this…

Listing E:

namespace Bar {
    int a;
    char b;
    void fun( void );
};

Ignoring the using keyword, there is absolutely no difference, syntactically or semantically, between the usage of Listings D and E. None at all. Think about it.

The only differences between classes and static namespaces are with respect to nonstatic methods and members. For nonstatic members and methods, the compiler will basically generate a struct containing the members (the address of which gets used for the this pointer), and for all method calls, performs the aforementioned “translation.” Voila! Procedurally, we’ve just reconstructed the object oriented paradigm. And it was just some hand waving on part of the compiler. A glorified mind trick, and nothing more.

However, this “glorified mind trick” is not to be underestimated. It has caused an industry-wide paradigm shift, and has itself paved the way for ever more abstract programming techniques. And this is the whole point of my example (besides, perhaps, to broaden your view of the object oriented pattern): to reinforce the age old “perception is reality” mantra. When I write something using the object oriented paradigm, I don’t worry that all that craziness is going on behind the scenes. I honestly don’t really care that it’s being turned into procedural code for me. I think in terms of objects, and I code in terms of objects, and it makes life easier all the way around. And that’s the point – while I’m coding objects, I’m perceiving objects, and while I perceive them, they are the reality of my code.

If your website is down, users don’t care that it’s actually the database that’s down, and not really the website. They won’t care whether it’s the database, the nameservers, or little green men swinging from the ladder racks in your network closet. If they can’t use it, it’s broken. And it doesn’t matter if the blame lies with you, them, or somebody in between.

*I am by no means a compiler developer, nor do have I studied compiler design theory in any depth yet. My claims in this post are merely arguments formed from years’ experience battling with (and against) GCC. My code (specifically regarding the “refactoring” that occurs between Listings B and C) is merely illustrational. However, at the end of the day, everything is machine code, twiddling bits on individual registers. So, whether at a high level or low, our wonderful objects do get deconstructed at some point. This is simply my theory as to how this gets done in C++ with the GNU Compiler Collection.

Git and Subversion

May 6th, 2009 Chris No comments

Back in August, I strongly recommended cried out for the average software developer to start using source control software. Today, as I opened up a fresh new project directory, I introspected before I called upon RapidSVN (which an Open Source GUI frontend for Subversion designed for Linux desktops). Why do I use Subversion?

Only because it was recommened to me.

Granted, that recommendation came from a great friend and mentor, and Subversion is really a wondeful client/server application. But… I thought… why keep the status quo for the sake of it? I’d read some press about Git lately, so I decided to give it a fair chance at winning me over. A test of usability.

It passed. With flying colors.

Cunningly enough, Git has a very well thought out introduction coming from the perspective of an SVN background. So useful, in fact, that my new project is now running with Git instead of SVN. And, when I create novel projects in the future? Git.

Subversion is nice. Reliable, which is what you need in source control software. But Git is reliable, easy to use, fast, scalable, and efficient.

I’ve administered Subversion services before. I can now say I’ve administered Git services. Git wins, hands down. From the website: “Following the Unix tradition, Git is a collection of many small tools written in C, and a number of scripts that provide convenient wrappers.” Git is remarkably easy to use. For addition, removal, and movement of files, commits, rollbacks, tagging and branching (which are done quite differently than in Subversion) – all are nearly an order of magnitude easier and more intuitive than Subversion.*

If you’ve come from Subversion, and been pulled into Git, how did it happen, and what was your experience? Have you left Git for Subversion? Why? Maybe you’re using Subversion now – try Git, and share your experiences!

*May the reader note that Subversion was my first experience with source control software, so I have to credit the initial source control learning curve as for part of my complaint. Also, there are a lot of similar processes between Subversion and Git (and all source control), so a lot of my prior knoweldge translated from one to the other.