Faster edit distance using anchor words

2024-05-30

Many common string comparison algorithms are O(N²) in time complexity, where we assume documents are of roughly equal length, N. This makes comparing long strings such as plaintext or HTML versions of books slow as molasses. For example, calculating the edit distance between two versions of Pride and Prejudice (125,000 words) takes 17 seconds on a laptop of recent vintage:

$ hyperfine "python3 edit_distance.py pg1342.txt pg42671.txt"
Benchmark 1: python3 edit_distance.py pg1342.txt pg42671.txt
  Time (mean ± σ):     16.808 s ±  0.068 s    [User: 16.774 s, System: 0.026 s]
  Range (min … max):   16.736 s … 16.945 s    10 runs
Output: 74321

Things get even slower if we’re dealing with longer texts—many books are longer than Pride and Prejudice—or, worse, if we have an algorithm that has cubic time complexity such as aligning the characters in three versions using dynamic programming.

Fortunately there’s a divide-and-conquer trick that speeds things up. It works whenever we’re comparing highly similar documents written in natural language. This trick turns our O(N²) problem into many small problems that can be tackled independently. It accomplishes this by first identifying unique words shared by all texts and then using these anchor words to partition the texts into aligned chunks. Because texts written in English, Spanish, Chinese, etc. tend to have many unique words there tend to be many such anchors. The abundance of anchors means that even if one or more of the texts is damaged or corrupted, there will still be numerous anchor words to use. Once the texts are divided up into aligned chunks, we then use our expensive algorithm on each chunk set. Finally, we aggregate the results.

If we break apart the two versions of Pride and Prejudice into aligned chunks using the ~2,300 anchor words detected (e.g., “overhearings”, “intrepidity”, “silliest”, “overthrowing”), we get several thousand aligned chunks. Operating on each independently (serially) and then aggregating the edit distances we get the expected result 117 times faster (0.143s).

$ hyperfine "cat filenames.txt | xargs python3 edit_distance_chunks.py"
Benchmark 1: cat filenames.txt | xargs python3 edit_distance_chunks.py
  Time (mean ± σ):     143.4 ms ±   3.2 ms    [User: 108.5 ms, System: 35.7 ms]
  Range (min … max):   140.0 ms … 153.1 ms    21 runs
Output: 74321

I haven’t found a canonical reference for the anchor words technique. I suspect it has been circulating for decades as it is an instance of a generic divide-and-conquer tactic often used with O(N²) problems. One paper that describes the trick with admirable clarity is Feng and Manmatha (2006). They use the technique to speed up estimating parameters of a linear HMM, another O(N²) algorithm. They are using the HMM to compare an OCR text with a reference text from Project Gutenberg. This is essentially the same problem as the one we just considered because the OCR version and the reference PG text are typically derived from different print editions. The Feng and Manmatha paper carefully walks through the procedure for identifying anchor words. They mention one step that one might otherwise miss: making sure the order of anchor words in the documents-to-be-chunked matches. This step matters if there are pages or sections that have been duplicated or transposed.

Python code implementing the technique as described by Feng and Manmatha (2006) is available for download.

What’s remarkable about the technique is how reliably it works. Due to the abundance of anchor words in this setting, we will always be able to find plenty of serviceable anchor words, no matter how corrupted or damaged the texts we are comparing are. In other settings where we would be comparing long character sequences, there’s no such abundance. If we’re comparing two globin proteins, for example, we don’t tend to have many long substrings that can be used as anchors. With documents written in natural language you’re effectively guaranteed to find anchor words everywhere because written texts tend to feature so many words used only a single time. As Feng and Manmatha point out, Zipf’s law tells us how we can quantify this: roughly half of the distinct words will be unique.

References

Feng, S., & Manmatha, R. (2006). A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (pp. 109-118). [PDF]