Lucene Span Queries - An Example

Credits: This note was inspired by the following Stack Overflow question and answer: SpanNot Lucene Query being either too strict or too permissive. That question is actually about ElasticSearch, but I wanted to show the equivalent Java/Lucene approach.

Summary

The Lucene JavaDocs provide a summary of the different span query operators, reproduced below, together with some examples (not shown here):

Operator	Overview
`SpanTermQuery`	Matches all spans containing a particular `Term`.
`SpanNearQuery`	Matches spans which occur near one another, and can be used to implement things like phrase search (when constructed from `SpanTermQuery`s) and inter-phrase proximity (when constructed from other `SpanNearQuery`s).
`SpanWithinQuery`	Matches spans which occur inside of another spans.
`SpanContainingQuery`	matches spans which contain another spans.
`SpanOrQuery`	Merges spans from a number of other `SpanQuery`s.
`SpanNotQuery`	Removes spans matching one `SpanQuery` which overlap (or comes near) another. This can be used, e.g., to implement within-paragraph search.
`SpanFirstQuery`	Matches spans matching `q` whose end position is less than `n`. This can be used to constrain matches to the first part of the document.
`SpanPositionRangeQuery`	A more general form of `SpanFirstQuery` that can constrain matches to arbitrary portions of the document.

Span Example

Match documents containing foo, ignoring instances of foo which are followed by bar or bat.

(The original real-world question was about finding instances of phrases such as “United Nations”, “United Airlines”, and so on - but specifically not counting the phrases “United States” and “United Kingdom”.)

The core of this is a SpanNot query:

SpanNotQuery(SpanQuery include, SpanQuery exclude)

This matches spans from include which have no overlap with spans from exclude. There are additional constructors which allow you to fine-tune the overlap. But here, we can use a simple overlap between our two span queries.

(There is more than one way to write such a Lucene query - but for the sake of this exercise, I will only use span operators.)

The overall structure of the query we need is:

1
2
3
4
5
6
7


spanNot(
   foo,
   spanOr(
     spanNear([foo, bar], 0, true),
     spanNear([foo, bat], 0, true)
   )
)

So, this finds any foo which does not overlap with foo bar or with foo bat.

Test Cases

The test documents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


"foo"                  // doc 0 - match
"bar"
"foo bar"
"foo baz"              // doc 3 - match
"foo bat"
"foo bar foo bat"
"foo baz foo bar"      // doc 6 - match
"foo bar foo bat"
"foo bat foo bar"
"foo bat foo foo bar"  // doc 9 - match

Note that a document containing foo bar or foo bat can still be a match if it also contains another foo.

Pedantically, this is slightly different from the specific real world example, which implies that foo needs to be followed by something other than bar or bat. But that is not explicitly stated, so I am ignoring that question, for this example.

The Span Query Code

This builds up the query step-by-step, for clarity:

Java


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


private static final String FIELD_NAME = "my_field_name";

SpanTermQuery fooTerm = new SpanTermQuery(new Term(FIELD_NAME, "foo"));
SpanTermQuery barTerm = new SpanTermQuery(new Term(FIELD_NAME, "bar"));
SpanTermQuery batTerm = new SpanTermQuery(new Term(FIELD_NAME, "bat"));

SpanQuery[] fooBar = {fooTerm, barTerm};
SpanQuery[] fooBat = {fooTerm, batTerm};

SpanNearQuery nearFooBar = new SpanNearQuery(fooBar, 0, true);
SpanNearQuery nearFooBat = new SpanNearQuery(fooBat, 0, true);

SpanOrQuery spanOr = new SpanOrQuery(nearFooBar, nearFooBat);

Query query = new SpanNotQuery(fooTerm, spanOr);

Two SpanNearQuerys are used to build the phrases foo bar and foo bat.

These are used by a SpanOrQuery, which then becomes the exclude clause of our SpanNotQuery.

This matches documents 0, 3, 6, and 9 - as expected.

Index Builder - code for reference

Java


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Paths;
import java.text.ParseException;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class MyIndexBuilder {

    private static final String INDEX_PATH = "./index";
    private static final String FIELD_NAME = "body";

    public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
        final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));

        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(OpenMode.CREATE);
        iwc.setCodec(new SimpleTextCodec());
        Document doc;

        // match "foo" unless "foo" is followed by "bar" or "bat"
        List<String> documentBodies = Arrays.asList(
                "foo", // 0 - match
                "bar",
                "foo bar",
                "foo baz", // 3 - match
                "foo bat",
                "foo bar foo bat",
                "foo baz foo bar", // 6 - match
                "foo bar foo bat",
                "foo bat foo bar",
                "foo bat foo foo bar"); // 9 - match

        try (IndexWriter writer = new IndexWriter(dir, iwc)) {
            for (String documentBody : documentBodies) {
                doc = new Document();
                doc.add(new TextField(FIELD_NAME, documentBody, Field.Store.YES));
                writer.addDocument(doc);
            }
        }
    }

}

Index Searcher - code for reference

Java


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78


import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queries.spans.SpanNearQuery;
import org.apache.lucene.queries.spans.SpanNotQuery;
import org.apache.lucene.queries.spans.SpanOrQuery;
import org.apache.lucene.queries.spans.SpanQuery;
import org.apache.lucene.queries.spans.SpanTermQuery;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.queryparser.flexible.core.QueryNodeException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;

public class MyIndexSearcher {

    private static final String INDEX_PATH = "./index";
    private static final String FIELD_NAME = "body";

    public static void doSearch() throws QueryNodeException, IOException, ParseException {
        //
        // match "foo" unless "foo" is followed by "bar" or "bat"
        //
        SpanTermQuery fooTerm = new SpanTermQuery(new Term(FIELD_NAME, "foo"));
        SpanTermQuery barTerm = new SpanTermQuery(new Term(FIELD_NAME, "bar"));
        SpanTermQuery batTerm = new SpanTermQuery(new Term(FIELD_NAME, "bat"));

        SpanQuery[] fooBar = {fooTerm, barTerm};
        SpanQuery[] fooBat = {fooTerm, batTerm};

        SpanNearQuery nearFooBar = new SpanNearQuery(fooBar, 0, true);
        SpanNearQuery nearFooBat = new SpanNearQuery(fooBat, 0, true);

        SpanOrQuery spanOr = new SpanOrQuery(nearFooBar, nearFooBat);

        Query query = new SpanNotQuery(fooTerm, spanOr);

        // should print docs 0, 3, 6, 9:
        printHits(query);
    }

    private static void printHits(Query query) throws IOException, ParseException {
        System.out.println("------------------------------");
        System.out.println("Parsed query: " + query.toString() + "\n");
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs results = searcher.search(query, 100);
        ScoreDoc[] hits = results.scoreDocs;
        if (hits.length == 0) {
            System.out.println("  No hits found.");
            System.out.println();
        }

        for (ScoreDoc hit : hits) {
            System.out.println("  doc id = " + hit.doc);
            System.out.println("  score  = " + hit.score);
            Document doc = searcher.storedFields().document(hit.doc);
            System.out.println("  field  = " + doc.get(FIELD_NAME));
            System.out.println();
        }
    }

    private static void printHits(String queryString, Analyzer analyzer)
            throws IOException, ParseException {
        QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
        Query query = parser.parse(queryString);
        printHits(query);
    }

}

Interval Functions

Something similar can be achieved using the Lucene StandardQueryParser, together with interval functions.

Using interval functions, we can create a query string with the following structure:

1
2
3
4
5
6
7
8


fn:notWithin(
  foo
  0
  fn:or(
    "foo bar"
    "foo bat"
  )
)

The fn:notWithin function “matches intervals of the source that do not appear within the provided number of positions from the intervals of the reference.”

Arguments:

fn:notWithin(source positions reference)

source
    source sub-interval (term or other function)
positions
    an integer number of maximum positions between source and reference
reference
    reference sub-interval (term or other function)

This is similar to our previous SpanNotQuery.

The code:

1
2
3


String queryString = "fn:notWithin(foo 0 fn:or(\"foo bar\" \"foo bat\"))";
StandardQueryParser parser = new StandardQueryParser(analyzer);
Query query = parser.parse(queryString, FIELD_NAME);

This query matches the same four documents from our original test data.

One difference from the earlier SpanNotQuery approach: Scoring values are different (I don’t know why).