HOME

How do Spellcheckers work?

(Levenshtein Distance)


Anyone who has used DTP (DeskTop Publishing) software knows about the immensely useful and sometimes intensely annoying piece of software that puts highly irritating red squiggles under the words as we type. Yes, we are talking about Spellcheckers. They are great because not only do they point out the misspelt word as we type but also offer a choice of possibly correct words. More often than not, the first or second choice is the one that we intended to type. So, how do they work?

At the heart of spellcheckers is a famous algorithm that computes a similarity score between words. Given any two words or even just sequences of characters, it compares them and computes the distance between them. The algorithm finds the minimum number of edit operations needed to transform the first word into the second. The number of edit operations is called the Levenshtein Edit Distance and the algorithm is the Levenshtein Edit Distance algorithm. The edit operations are: insert, delete and substitute. For example, to transform the word train into travel, we can (a)substitute i with v; substitute n with e; and then insert the letter l. Another way is to (a) delete the letters i and n; and insert the letters v, e and l. In fact, three operations is the minimum required to transform train into travel and therefore the Levenshtein Edit Distance is 3.


How is Levenshtein Edit Distance computed?

The standard way of computing Levenshtein Edit Distance uses a grid/matrix of numbers that are filled from top-left to bottom-right. The size of the matrix is m rows and n columns where m is the number of letters in the second word and n is the number of letters in the first word. The edit distance between the two words is given by the number in the last column of the last row.

As an example, let us compute the edit distance between train and travel. The matrix we need has 5 columns and 6 rows as shown below. An additional row and column are added (indicated with the bluish grey background) to help set up the calculations.

tra in
0 12 34 5
t1 01 234
r21 0123
a321 012
v43 2112
e54 322 2
l65 433 3

We start filling the matrix by adding one dummy row and dummy column (the ones in bluish grey). The row indicates how we can transform a "blank" into the word "train". The 1 under the letter "t" indicates that we need one edit operation to make a "blank" into "t"; and it is done by inserting the new letter "t". Exactly with the same reasoning, the edit distances (i.e., the number of edit operations) between "tr" and blank is 2, between "tra" and blank is 3 and so on.

The dummy column is also filled in a similar way: how many operations does it take to convert a "blank" into "t", "tr", "tra", "trav" and until "travel". The edit distances are given in the column.

Now, we are ready for the real deal! The first row of the matrix is the one containing "t" of "travel" and the first column is the one under "t" of "train". In this matrix, when we look at any number in row i and column j, it gives the number of edit operations required to transform the first j letters of the first into the first i letters of the second word. For example, the 0 in the third row and the third column (given in green) gives the number of edit operations required to transform the first three letters of "train", i.e., "tra" into the first three letters of "travel", i.e., "tra". As these two are identical, the number of edit operations is 0.

It is now easy to fill the matrix. For every cell in row i and column j, look at its neighbours to the top, to the left and to the top left. Find the minimum value among these three cells. If the $j^{th}$ letter of the first word is equal to the $i^{th}$ letter of the second word, fill it with the minimum value. If they are not same, add 1 to the minimum value and fill it in the cell. Proceed this way until the bottom-right cell is filled. The value in the bottom right cell is the distance between the two words. Work it out by yourself to see if you followed the explanation.

The distance between "train" and "travel" is 3 (the orange coloured last cell). What are the three operations you need?

As another example, look at the 5th row and 4th column (again given in green). It should give the edit distance betweem the first 4 letters of "train", i.e., "trai" and the first 5 letters of "travel", i.e., "trave". The distance is 2 and we see that we do need two operations - one to substitute "v" for "i" and another to insert "e". Also, note that the minimum of the three values (top: 1, left: 2, and top-left: 1) is 1. As the two letters "i" and "e" are not the same, the value to fill is 1 plus the minimum which is equal to 2.

V. I. Levenshtein


Vladimir Iosifovich Levenshtein is known as the "Father of Coding Theory" in Russia. He is a member of the Russian Academy of Sciences and an IEEE Fellow.