Information Retrieval
Information retrieval. It’s a high-brow word for search. I’ve actually begun breathing new life into an old search-related project in the past few days, and upon reading What 255 characters looks like, a particular section caught my eye.
It’s true that in most cases it won’t make a difference. However, if you need to index and search the field, you should think carefully before blindly using TEXT. The data in TEXT type fields are stored outside the table itself, using only a few bytes for pointer information. This means that TEXT fields are not indexed, while VARCHAR fields are. This can have a tremendous effect on your SQL query speeds, as generally larger TEXT fields increase query time exponentially. Even if we take indexing out of the picture, the external storage of TEXT fields means that you’ll still see generally faster searches with VARCHAR.
Frank here presents nothing but a rational, fact based explanation of the biggest [performance] differences between VARCHAR and TEXT fields in MySQL. But it begs a rather distinct question, in my mind. What do you do, if you have too much data for a VARCHAR field, but require fast searching nonetheless? My first reaction, is not to solve a problem that’s already been solved. There are third party applications, such as Sphinx, that do a wonderful job of indexing content. After all, Jeff Atwood makes this point very clearly in his perfectly-named post Don’t Reinvent The Wheel, Unless You Plan on Learning More About Wheels.
Done and done. Problem solved. But… what if you really do want to just learn more about wheels information retrieval? Firstly, I would highly recommend reading both Frank and Jeff’s posts. You should, then, have a rough idea of where I’m coming from to NOT use a Sphinx-esque product for my new venture. That all said, let’s get started. We’ll piggyback on Frank’s use of email searching throughout our examples.
We’ll create a table to store emails, and keep it simple, for the sake of example. I’m using MySQL 5, and I like UTF-8.
CREATE TABLE `foobar`.`emails` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`subject` VARCHAR( 255 ) NOT NULL ,
`body` TEXT NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci
Here we have a primary key field, a subject (VARCHAR) and a body (TEXT).
Note: In Information Retrieval (“IR”) this is called the forward index. If you know the ID of an email, you can get all the tokens in that email.
Let’s assume you wanted to search all emails for the word “fortran”. In a plain vanilla world, your query might look something like
SELECT * FROM emails WHERE body LIKE '%fortran%' OR subject LIKE '%fortran%';
This means, that MySQL must fetch, load, and scan nearly every character of every email in the table. If you’ve got three messages, no big deal. My Gmail inbox, however, at the moment is around 8000 conversations – about 570MB. That would take a while. The alternative, is to put effort into doing all this scanning ahead of time. And that’s the essence of the term “indexing”.
We’ll create a script (or program, or stored procedure, etc) that goes through, and finds all the words ahead of time for us. And then we use a level of indirection at search-time, to get our information. Take the following three emails:
Email #1
Subject: programming class
Body: I have to write a fortran program for my computer science class.
Email #2
Subject: programming class
Body: Have you ever written fortran before?
Email #3
Subject: spam
Body: I will try to trick you into clicking these links.
Our script will tokenize each text field of interest, compile a list of all unique tokens. and store them in a database table. A table, for starters, something like this
CREATE TABLE `foobar`.`inverted_index` (
`token` VARCHAR( 255 ) NOT NULL ,
`emails` TEXT NOT NULL ,
UNIQUE (
`token`
)
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci
For our three sample emails, the unique tokens are the set: { a, before, class, clicking, computer, ever, for, fortran, have, i, into, links, my, program, programming, science, spam, these, to, trick, try, will, write, written, you }
So, we insert these tokens, one by one. We then update the table – for each token, we find the set of emails in which that token appears, and store their IDs. For example, the row for token “fortran” would look like: “fortran” => {1,2}. Another row might be “to” => {1,3}, or “written” => {2}. We store all of these in the database table called “inverted_index”. It’s called the inverted index because, conversely from the forward index, given a token, we know all emails containing that token.
And that’s about it for the most basic index design. Let’s go back to our search query. It now becomes:
SELECT * FROM inverted_index WHERE token = ‘fortran’;
This query tells MySQL to scan one column of a sorted and uniquely indexed VARCHAR field. A WHOLE lot faster than our original search query. I’ve just run a comparison search on my database. The corpus is a little over 250,000 rows, and consumes about 50MB on disk. I ran a search for any items with both of two tokens. I ran an “old-school” MySQL LIKE query looking for two different tokens – a full text scan. The query took 0.7504 seconds, according to MySQL. When I ran our more advanced search, a query for those same two tokens came back in just 0.0005 seconds.
Don’t be disillusioned though – that query speed and simplicity comes at a defined cost – the cost of set operations on the application side. If you’re searching for emails which contain all of three different terms, you’ve got to find the intersection of all three sets returned by our new query on your own. However – with very large corpuses, this overhead is quickly compensated for (and then some).
And there you have it – the diving platform of indexed text search. Of course, even with this simple setup, there are dozens of optimizations one could make. Maybe I’ll revisit this topic again sometime, with a more advanced look at some of the concepts presented.