It is possible to write proximity queries using Lucene’s classic query parser such as this:
This is described as meaning:
search for “apache” and “jakarta” within 10 words of each other in a document
10 is unpleasantly referred to as the “phrase slop” or just “the slop”. But what exactly does within 10 words of each other mean?
The PhraseQuery class provides more information:
The slop is an edit distance between respective positions of terms as defined in this PhraseQuery and the positions of terms in a document.
Um, example, please?
For instance, when searching for “quick fox”, it is expected that the difference between the positions of fox and quick is 1. So “a quick brown fox” would be at an edit distance of 1 since the difference of the positions of fox and quick is 2. Similarly, “the fox is quick” would be at an edit distance of 3 since the difference of the positions of fox and quick is -2. The slop defines the maximum edit distance for a document to match.
I read that several times and was thoroughly confused.
Partly, this confusion was caused by my understanding of the term edit distance. Specifically, I assumed this meant the Levenshtein distance, in which you make a sequence of insertions, deletions and substitutions to transform one word into another. The number of such transformations needed is the edit distance.
But in Lucene, it’s a different definition - more like:
How many word moves do you need, to transform one phrase into another?
A “move” here is when you move a word from its current position to an adjacent position. And a critical part of that is to realize that one position can therefore contain more than one word during (and at the end of) this process.
Revisiting the above example, let’s assume the following two documents, and the search phrase
a quick brown fox
the fox is quick
To transform the document phrase
quick brown fox into the search phrase
quick fox requires one move: Move
fox from position 4 to position 3. Now, with a “slop” of 1, you can re-arrange the document so that it contains the search phrase:
Looking at both documents:
"quick fox"~0 finds no hits (this is equivalent to
"quick fox" - a multi-word phrase)
"quick fox"~1 only finds doc1 (as described above).
"quick fox"~2 also only finds doc1.
"quick fox"~3 finds doc1 and doc2.
Why did it take a slop of
3 for the second document to be found? Because word order is significant. It takes more moves to re-arrange the terms in doc 2 to match the search term - including the search term’s word order.
What if the search term has more than two words in it? It’s the same process.
Assume an “animals.txt” document containing the following:
ant bat cat dog elephant fish giraffe
And assume the search phrase
"fish dog bat".
What is the minumum slop needed to get a hit from our animals document?
It takes six steps to re-arrange our animals text so that it contains our search text:
- 3 steps to move “fish” to in front of “dog”
- 3 more steps to move “bat” to after “dog”
"fish dog bat"~6