Anyone who has used DTP (DeskTop Publishing) software knows about the immensely useful and sometimes intensely annoying piece of software that puts highly irritating red squiggles under the words as we type. Yes, we are talking about Spellcheckers. They are great because not only do they point out the misspelt word as we type but also offer a choice of possibly correct words. More often than not, the first or second choice is the one that we intended to type. So, how do they work?
At the heart of spellcheckers is a famous algorithm that
computes a similarity score between words. Given any two words
or even just sequences of characters, it compares them and
computes the distance between them. The
algorithm
The standard way of computing Levenshtein Edit Distance uses a grid/matrix of numbers that are filled from top-left to bottom-right. The size of the matrix is m rows and n columns where m is the number of letters in the second word and n is the number of letters in the first word. The edit distance between the two words is given by the number in the last column of the last row.
As an example, let us compute the edit distance
between train and travel. The matrix we need
has 5 columns and 6 rows as shown below. An additional row and
column are added (indicated with
the
t | r | a | i | n | ||
---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | |
t | 1 | 0 | 1 | 2 | 3 | 4 |
r | 2 | 1 | 0 | 1 | 2 | 3 |
a | 3 | 2 | 1 | 0 | 1 | 2 |
v | 4 | 3 | 2 | 1 | 1 | 2 |
e | 5 | 4 | 3 | 2 | 2 | 2 |
l | 6 | 5 | 4 | 3 | 3 | 3 |
We start filling the matrix by adding one dummy row and dummy column (the ones in bluish grey). The row indicates how we can transform a "blank" into the word "train". The 1 under the letter "t" indicates that we need one edit operation to make a "blank" into "t"; and it is done by inserting the new letter "t". Exactly with the same reasoning, the edit distances (i.e., the number of edit operations) between "tr" and blank is 2, between "tra" and blank is 3 and so on.
The dummy column is also filled in a similar way: how many operations does it take to convert a "blank" into "t", "tr", "tra", "trav" and until "travel". The edit distances are given in the column.
Now, we are ready for the real deal! The first row of the
matrix is the one containing "t" of "travel" and the first
column is the one under "t" of "train". In this matrix, when
we look at any number in row i and column j,
it gives the number of edit operations required to transform
the first j letters of the first into the
first i letters of the second word. For example, the
0 in the third row and the third column (given
in
It is now easy to fill the matrix. For every cell in row i and column j, look at its neighbours to the top, to the left and to the top left. Find the minimum value among these three cells. If the $j^{th}$ letter of the first word is equal to the $i^{th}$ letter of the second word, fill it with the minimum value. If they are not same, add 1 to the minimum value and fill it in the cell. Proceed this way until the bottom-right cell is filled. The value in the bottom right cell is the distance between the two words. Work it out by yourself to see if you followed the explanation.
The distance between "train" and "travel" is 3 (the orange coloured last cell). What are the three operations you need?
As another example, look at the 5th row and 4th column (again
given in