More on Lucene Classic Queries

27 Aug 2021

Table of Contents


Confusion

What is the difference between the - (exclusion) operator and the NOT operator?

Why does the query -anything always return no results, regardless of what “anything” is and regardless of what data has been indexed?

When does a query of A B mean “A or B” and when does it mean “A and B”?

…and various other questions.

Set-Up

For this article, I use:

  • Lucene 8.9.0
  • the StandardAnalyzer
  • a TextField named “body”
  • the classic query parser
  • the following 6 test documents:
  1. apples
  2. oranges
  3. apples oranges
  4. bananas
  5. apples bananas
  6. oranges bananas

I tested a set of queries against the above documents, with the results summarized below:

Query string Parsed query Hits Notes
apples body:apples doc = 0;score = 0.3648143field = applesdoc = 2;score = 0.2772589field = apples orangesdoc = 4;score = 0.2772589field = apples bananas 1
apples oranges body:apples body:oranges doc = 2;score = 0.5545178field = apples orangesdoc = 0;score = 0.3648143field = applesdoc = 1;score = 0.3648143field = orangesdoc = 4;score = 0.2772589field = apples bananasdoc = 5;score = 0.2772589field = oranges bananas 2
+apples oranges +body:apples body:oranges doc = 2;score = 0.5545178field = apples orangesdoc = 0;score = 0.3648143field = applesdoc = 4;score = 0.2772589field = apples bananas 3
6 queries: apples -orangesapples NOT orangesapples OR -orangesapples OR NOT orangesapples AND -orangesapples AND NOT oranges body:apples -body:oranges doc = 0;score = 0.3648143field = applesdoc = 4;score = 0.2772589field = apples bananas 4
5 queries: -applesNOT apples-anythingNOT anything-"apples oranges" -body:apples-body:anything-body:"apples oranges" No hits found. 5

Notes:

  1. Any docs containing “apples”. The doc containing only “apples” scores higher than others.

  2. Any docs containing either “apples” or “oranges”. The doc containing both scores higher than others.

  3. Must contain “apples”. May also contain “oranges” and, if so, these docs score higher than docs which do not contain “oranges”.

  4. All these 6 queries give the same result: documents containing “apples” but not “oranges”

  5. No results found due to how - and NOT operate when the query contains only a single term. More on this below.

The Exclusion Operator

For the tests relating to note 4 above, we can see that all of the NOT operators were converted to - operators when the query string was parsed. Also, in these specific tests, the inclusion or absence of AND made no difference.

This gives us a clue that - (and therefore also NOT) is not a traditional boolean operator. If it were, then A and not B would be different from A or not B.

The exclusion - operator (and NOT) forces all documents containing the term to have their scores set to zero - which means those documents will not be returned as matches.

Now when we look at the tests for note 5, we see that these queries always generate no hits, regardless of the data in the source documents.

This may be counterintuitive - especially -anything.

In the first case, the single clause -body:apples forces all documents containing apples to be given a score of zero. But now there are no more clauses in the query - and therefore there is no additional information which can be used to calculate any scores for the remaining documents. They therefore all stay at their initial state of “unscored”. Therefore, no documents can be returned.

In the second case, -body:anything, the overall logic is the same. After removing all the documents containing anything from scoring consideration (even if that means removing no documents at all), there is still no more information in the query which can be used for scoring purposes.

Old Insights

There’s nothing new in the above discussion. It has all been covered before, but may still be confusing to newcomers (it was certainly confusing to me).

There is an old Lucene discussion thread from 2007: Getting a Better Understanding of Lucene’s Search Operators, which contains an extremely helpful section, found here, reproduced here for posterity.

Credit to Chris Hostetter, who wrote the following:

begin copied section


In a nutshell…

  1. Lucene’s QueryParser class does not parse boolean expressions – it might look like it, but it does not.

  2. Lucene’s BooleanQuery clause does not model Boolean Queries … it models aggregate queries.

  3. the most native way to represent the options available in a lucene “BooleanQuery” as a string is with the +/- prefixes, where…

+foo … means foo is a required clause and docs must match it

-foo … means foo is prohibited clause and docs must not match it

foo … means foo is an optional clause and docs that match it will get score benefits for doing so.

  1. in an attempt to make things easier for people who have simple needs, QueryParser “fakes” that it parses boolean expressions by interpreting A AND B as +A +B; A OR B as A B and NOT A as -A

  2. if you change the default operator on QueryParser to be AND then things get more complicated, mainly because then QueryParser treats A B the same as +A +B

  3. you should avoid thinking in terms of AND, OR, and NOT … think in terms of OPTIONAL, REQUIRED, and PROHIBITED … your life will be much easier: documentation will make more sense, conversations on the email list will be more synergistastic, wine will be sweeter, and food will taste better.


end copied section

ANDs, ORs and Parentheses

Parentheses ( and ) work the way you would expect, allowing you to construct complex queries.

Some basic examples:

Query string Parsed query Doc Hits
apples AND oranges +body:apples +body:oranges 2
apples OR oranges body:apples body:oranges 2, 0, 1, 4, 5
apples AND oranges OR bananas +body:apples +body:oranges body:bananas 2
apples OR oranges AND bananas body:apples +body:oranges +body:bananas 5
apples AND (oranges OR bananas) +body:apples +(body:oranges body:bananas) 2, 4
(apples AND oranges) OR bananas (+body:apples +body:oranges) body:bananas 2, 3, 4, 5

Some Additional Notes

As we have seen, by default, the classic query parser uses an implied “or” when you write a query such as A B. You can change this from “or” to “and” using the setDefaultOperator() method.

The syntax overview documentation.

The QueryParser object.

Don’t forget to escape characters with special meanings: See the escape() method.

The special characters are:

+  -  &&  ||  !  (  )  {  }  [  ]  ^  "  ~  *  ?  :  \  /

Lucene's BooleanQuery

Lucene’s BooleanQuery class is similar to the above syntax in several way - it’s not really a boolean processor. I have already written about that elsewhere.

The Code I Used

I index the documents as follows:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Paths;
import java.text.ParseException;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class MyIndexBuilder {

    private static final String INDEX_PATH = "./index";
    private static final String FIELD_NAME = "body";

    public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
        final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(OpenMode.CREATE);
        Document doc;

        List<String> documentBodies = Arrays.asList(
                "apples",
                "oranges",
                "apples oranges",
                "bananas",
                "apples bananas",
                "oranges bananas");

        try ( IndexWriter writer = new IndexWriter(dir, iwc)) {

            for (String documentBody : documentBodies) {
                doc = new Document();
                doc.add(new TextField(FIELD_NAME, documentBody, Field.Store.YES));
                writer.addDocument(doc);
            }
        }
    }

}

I run queries as follows:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.queryparser.flexible.core.QueryNodeException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;

public class MyIndexSearcher {

    private static final String INDEX_PATH = "./index";
    private static final String FIELD_NAME = "body";
    private static final StandardAnalyzer ANALYZER = new StandardAnalyzer();
    private static final QueryParser PARSER = new QueryParser(FIELD_NAME, ANALYZER);

    public static void doSearch() throws QueryNodeException, IOException, ParseException {

        List<String> queryStrings = Arrays.asList(
                "apples",
                "apples oranges",
                "+apples oranges",
                "apples -oranges",
                "apples NOT oranges",
                "apples OR -oranges",
                "apples OR NOT oranges",
                "-apples",
                "NOT apples",
                "-anything",
                "NOT anything",
                "apples AND NOT oranges",
                "apples AND -oranges",
                "apples -\"apples oranges\"",
                "-\"apples oranges\"");

        for (String queryString : queryStrings) {
            printHits(queryString);
        }
    }

    private static void printHits(String queryString) throws IOException, ParseException {
        Query query = PARSER.parse(queryString);
        System.out.println("------------------------------");
        System.out.println("Query string: " + queryString);
        System.out.println("Parsed query: " + query.toString() + "\n");
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs results = searcher.search(query, 100);
        ScoreDoc[] hits = results.scoreDocs;
        if (hits.length == 0) {
            System.out.println("No hits found.");
            System.out.println();
        }
        for (ScoreDoc hit : hits) {
            System.out.println("  doc = " + hit.doc + "; score = " + hit.score);
            Document doc = searcher.doc(hit.doc);
            System.out.println("  field = " + doc.get(FIELD_NAME));
            System.out.println();
        }
    }

}