What is the difference between the -
(exclusion) operator and the NOT
operator?
Why does the query -anything
always return no results, regardless of what “anything” is and regardless of what data has been indexed?
When does a query of A B
mean “A or B” and when does it mean “A and B”?
…and various other questions.
For this article, I use:
StandardAnalyzer
TextField
named “body”I tested a set of queries against the above documents, with the results summarized below:
Query string | Parsed query | Hits | Notes |
---|---|---|---|
apples |
body:apples |
doc = 0;score = 0.3648143field = applesdoc = 2;score = 0.2772589field = apples orangesdoc = 4;score = 0.2772589field = apples bananas | 1 |
apples oranges |
body:apples body:oranges |
doc = 2;score = 0.5545178field = apples orangesdoc = 0;score = 0.3648143field = applesdoc = 1;score = 0.3648143field = orangesdoc = 4;score = 0.2772589field = apples bananasdoc = 5;score = 0.2772589field = oranges bananas | 2 |
+apples oranges |
+body:apples body:oranges |
doc = 2;score = 0.5545178field = apples orangesdoc = 0;score = 0.3648143field = applesdoc = 4;score = 0.2772589field = apples bananas | 3 |
6 queries: apples -oranges apples NOT oranges apples OR -oranges apples OR NOT oranges apples AND -oranges apples AND NOT oranges |
body:apples -body:oranges |
doc = 0;score = 0.3648143field = applesdoc = 4;score = 0.2772589field = apples bananas | 4 |
5 queries: -apples NOT apples -anything NOT anything -"apples oranges" |
-body:apples -body:anything -body:"apples oranges" |
No hits found. | 5 |
Notes:
Any docs containing “apples”. The doc containing only “apples” scores higher than others.
Any docs containing either “apples” or “oranges”. The doc containing both scores higher than others.
Must contain “apples”. May also contain “oranges” and, if so, these docs score higher than docs which do not contain “oranges”.
All these 6 queries give the same result: documents containing “apples” but not “oranges”
No results found due to how -
and NOT
operate when the query contains only a single term. More on this below.
For the tests relating to note 4 above, we can see that all of the NOT
operators were converted to -
operators when the query string was parsed. Also, in these specific tests, the inclusion or absence of AND
made no difference.
This gives us a clue that -
(and therefore also NOT
) is not a traditional boolean operator. If it were, then A and not B
would be different from A or not B
.
The exclusion -
operator (and NOT
) forces all documents containing the term to have their scores set to zero - which means those documents will not be returned as matches.
Now when we look at the tests for note 5, we see that these queries always generate no hits, regardless of the data in the source documents.
This may be counterintuitive - especially -anything
.
In the first case, the single clause -body:apples
forces all documents containing apples
to be given a score of zero. But now there are no more clauses in the query - and therefore there is no additional information which can be used to calculate any scores for the remaining documents. They therefore all stay at their initial state of “unscored”. Therefore, no documents can be returned.
In the second case, -body:anything
, the overall logic is the same. After removing all the documents containing anything
from scoring consideration (even if that means removing no documents at all), there is still no more information in the query which can be used for scoring purposes.
There’s nothing new in the above discussion. It has all been covered before, but may still be confusing to newcomers (it was certainly confusing to me).
There is an old Lucene discussion thread from 2007: Getting a Better Understanding of Lucene’s Search Operators, which contains an extremely helpful section, found here, reproduced here for posterity.
Credit to Chris Hostetter, who wrote the following:
begin copied section
In a nutshell…
Lucene’s QueryParser
class does not parse boolean expressions – it
might look like it, but it does not.
Lucene’s BooleanQuery
clause does not model Boolean Queries … it
models aggregate queries.
the most native way to represent the options available in a lucene “BooleanQuery” as a string is with the +/- prefixes, where…
+foo
… means foo is a required clause and docs must match it
-foo
… means foo is prohibited clause and docs must not match it
foo
… means foo is an optional clause and docs that match it will
get score benefits for doing so.
in an attempt to make things easier for people who have
simple needs, QueryParser
“fakes” that it parses boolean expressions
by interpreting A AND B
as +A +B
; A OR B
as A B
and NOT A
as
-A
if you change the default operator on QueryParser
to be AND
then
things get more complicated, mainly because then QueryParser
treats
A B
the same as +A +B
you should avoid thinking in terms of AND
, OR
, and NOT
… think in
terms of OPTIONAL
, REQUIRED
, and PROHIBITED
… your life will be much
easier: documentation will make more sense, conversations on the email
list will be more synergistastic, wine will be sweeter, and food will
taste better.
end copied section
Parentheses (
and )
work the way you would expect, allowing you to construct complex queries.
Some basic examples:
Query string | Parsed query | Doc Hits |
---|---|---|
apples AND oranges |
+body:apples +body:oranges |
2 |
apples OR oranges |
body:apples body:oranges |
2, 0, 1, 4, 5 |
apples AND oranges OR bananas |
+body:apples +body:oranges body:bananas |
2 |
apples OR oranges AND bananas |
body:apples +body:oranges +body:bananas |
5 |
apples AND (oranges OR bananas) |
+body:apples +(body:oranges body:bananas) |
2, 4 |
(apples AND oranges) OR bananas |
(+body:apples +body:oranges) body:bananas |
2, 3, 4, 5 |
As we have seen, by default, the classic query parser uses an implied “or” when you write a query such as A B
. You can change this from “or” to “and” using the setDefaultOperator()
method.
The syntax overview documentation.
The QueryParser
object.
Don’t forget to escape characters with special meanings: See the escape()
method.
The special characters are:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
Lucene’s BooleanQuery
class is similar to the above syntax in several way - it’s not really a boolean processor. I have already written about that elsewhere.
I index the documents as follows:
|
|
I run queries as follows:
|
|