Monday, May 24, 2010

Misspellings In Search

Ever heard of a word ladder? It's a game invented by the Lewis Carroll, the author of Alice in Wonderland. Here's how it works: you're given two words, and you start with one of them. Change one letter in the word at a time, or add or remove a letter, to arrive at the second word. At each step along the way, the changes you make have to result in valid words.

Here's an example: how do you turn "LEAD" into "GOLD"? Try it yourself right now if you're interested in solving the puzzle.

Here's the solution I devised back when I tackled this problem in middle school, changing one letter at a time:

LEAD
HEAD
HELD
HOLD
GOLD

The key here is that I changed the beginning "L" to an "H", since "LELD" and "LOLD" are not real words. It's an extra step that's needed to conform to the rules of the game. But as you can see, it takes four "steps" on the word ladder in order to get from the starting word to the ending one.

Counting the number of steps from the first word to the second can be a useful number to calculate when you want to offer spelling corrections to users who are doing internal searches on your site. You know how Google offers you polite suggestions when you misspell a word? (e.g. "Did you mean: 'electric guitar'")

The number of character insertions, deletions, and substitutions required to convert one string to another is known as the Levenshtein distance. Unlike word ladders, the Levenshtein distance doesn't care if each step along the way results in an actual word. So, the steps between "stretch" and "straight" would be as follows:

stretch
stretcht
stratcht
straicht
straight

The Levenshtein distance between the two string is 4. The closer two words are together, the lower the distance score between the two of them will be.

What does this mean for misspelled search terms? If someone enters some search text on your site, and there are zero matching results returned, you might consider sweeping through your database and looking for strings or phrases with a very low Levenshtein distance, in an effort to catch misspelled words.

If someone enters "accoustic guitar", they might get no matching results. But, if your product data contains the phrase "acoustic guitar", you might be able to catch that and display a hyperlink offering the alternative "Did you mean...?" text that will repeat the search with the new word if clicked.

So how do you calculate the Levenshtein distance in Python? There are a few ready-to-use Python functions listed on this Levenshtein Distance wiki page. I've been using the first one listed in a couple of my projects and it seems to work very well.

Naturally, this is not a substitution for a complete full-text search package like Django Sphinx or your actual search algorithm; it's just something you might consider falling back on. Also, you want to be very careful about how you implement this on a live site where you're searching through thousands of records. Computing the Levenshtein distance between some search text and the words and phrases in a TextField (like, for example, in a product description field), would be a very slow, inefficient way to search for possible spelling corrections. If you do implement something, make sure you test its performance before releasing it into the wild.

However, if you have a small site in development, and want to try your hand at helping guide the user if they've possibly misspelled a search term, I found the Levenshtein distance is a great quick-and-dirty way to get started.

No comments:

Post a Comment