Credits: This note was inspired by the following Stack Overflow question and answer: SpanNot Lucene Query being either too strict or too permissive. That question is actually about ElasticSearch, but I wanted to show the equivalent Java/Lucene approach.
Summary
The Lucene JavaDocs provide a summary of the different span query operators, reproduced below, together with some examples (not shown here):
Operator | Overview |
---|
SpanTermQuery | Matches all spans containing a particular Term . |
SpanNearQuery | Matches spans which occur near one another, and can be used to implement things like phrase search (when constructed from SpanTermQuery s) and inter-phrase proximity (when constructed from other SpanNearQuery s). |
SpanWithinQuery | Matches spans which occur inside of another spans. |
SpanContainingQuery | matches spans which contain another spans. |
SpanOrQuery | Merges spans from a number of other SpanQuery s. |
SpanNotQuery | Removes spans matching one SpanQuery which overlap (or comes near) another. This can be used, e.g., to implement within-paragraph search. |
SpanFirstQuery | Matches spans matching q whose end position is less than n . This can be used to constrain matches to the first part of the document. |
SpanPositionRangeQuery | A more general form of SpanFirstQuery that can constrain matches to arbitrary portions of the document. |
Span Example
Match documents containing foo
, ignoring instances of foo
which are followed by bar
or bat
.
(The original real-world question was about finding instances of phrases such as “United Nations”, “United Airlines”, and so on - but specifically not counting the phrases “United States” and “United Kingdom”.)
The core of this is a SpanNot
query:
SpanNotQuery​(SpanQuery include, SpanQuery exclude)
This matches spans from include
which have no overlap with spans from exclude
. There are additional constructors which allow you to fine-tune the overlap. But here, we can use a simple overlap between our two span queries.
(There is more than one way to write such a Lucene query - but for the sake of this exercise, I will only use span operators.)
The overall structure of the query we need is:
1
2
3
4
5
6
7
| spanNot(
foo,
spanOr(
spanNear([foo, bar], 0, true),
spanNear([foo, bat], 0, true)
)
)
|
So, this finds any foo
which does not overlap with foo bar
or with foo bat
.
Test Cases
The test documents:
1
2
3
4
5
6
7
8
9
10
| "foo" // doc 0 - match
"bar"
"foo bar"
"foo baz" // doc 3 - match
"foo bat"
"foo bar foo bat"
"foo baz foo bar" // doc 6 - match
"foo bar foo bat"
"foo bat foo bar"
"foo bat foo foo bar" // doc 9 - match
|
Note that a document containing foo bar
or foo bat
can still be a match if it also contains another foo
.
Pedantically, this is slightly different from the specific real world example, which implies that foo
needs to be followed by something other than bar
or bat
. But that is not explicitly stated, so I am ignoring that question, for this example.
The Span Query Code
This builds up the query step-by-step, for clarity:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| private static final String FIELD_NAME = "my_field_name";
SpanTermQuery fooTerm = new SpanTermQuery(new Term(FIELD_NAME, "foo"));
SpanTermQuery barTerm = new SpanTermQuery(new Term(FIELD_NAME, "bar"));
SpanTermQuery batTerm = new SpanTermQuery(new Term(FIELD_NAME, "bat"));
SpanQuery[] fooBar = {fooTerm, barTerm};
SpanQuery[] fooBat = {fooTerm, batTerm};
SpanNearQuery nearFooBar = new SpanNearQuery(fooBar, 0, true);
SpanNearQuery nearFooBat = new SpanNearQuery(fooBat, 0, true);
SpanOrQuery spanOr = new SpanOrQuery(nearFooBar, nearFooBat);
Query query = new SpanNotQuery(fooTerm, spanOr);
|
Two SpanNearQuery
s are used to build the phrases foo bar
and foo bat
.
These are used by a SpanOrQuery
, which then becomes the exclude
clause of our SpanNotQuery
.
This matches documents 0, 3, 6, and 9 - as expected.
Index Builder - code for reference
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
| import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Paths;
import java.text.ParseException;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class MyIndexBuilder {
private static final String INDEX_PATH = "./index";
private static final String FIELD_NAME = "body";
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
Document doc;
// match "foo" unless "foo" is followed by "bar" or "bat"
List<String> documentBodies = Arrays.asList(
"foo", // 0 - match
"bar",
"foo bar",
"foo baz", // 3 - match
"foo bat",
"foo bar foo bat",
"foo baz foo bar", // 6 - match
"foo bar foo bat",
"foo bat foo bar",
"foo bat foo foo bar"); // 9 - match
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
for (String documentBody : documentBodies) {
doc = new Document();
doc.add(new TextField(FIELD_NAME, documentBody, Field.Store.YES));
writer.addDocument(doc);
}
}
}
}
|
Index Searcher - code for reference
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
| import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queries.spans.SpanNearQuery;
import org.apache.lucene.queries.spans.SpanNotQuery;
import org.apache.lucene.queries.spans.SpanOrQuery;
import org.apache.lucene.queries.spans.SpanQuery;
import org.apache.lucene.queries.spans.SpanTermQuery;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.queryparser.flexible.core.QueryNodeException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
public class MyIndexSearcher {
private static final String INDEX_PATH = "./index";
private static final String FIELD_NAME = "body";
public static void doSearch() throws QueryNodeException, IOException, ParseException {
//
// match "foo" unless "foo" is followed by "bar" or "bat"
//
SpanTermQuery fooTerm = new SpanTermQuery(new Term(FIELD_NAME, "foo"));
SpanTermQuery barTerm = new SpanTermQuery(new Term(FIELD_NAME, "bar"));
SpanTermQuery batTerm = new SpanTermQuery(new Term(FIELD_NAME, "bat"));
SpanQuery[] fooBar = {fooTerm, barTerm};
SpanQuery[] fooBat = {fooTerm, batTerm};
SpanNearQuery nearFooBar = new SpanNearQuery(fooBar, 0, true);
SpanNearQuery nearFooBat = new SpanNearQuery(fooBat, 0, true);
SpanOrQuery spanOr = new SpanOrQuery(nearFooBar, nearFooBat);
Query query = new SpanNotQuery(fooTerm, spanOr);
// should print docs 0, 3, 6, 9:
printHits(query);
}
private static void printHits(Query query) throws IOException, ParseException {
System.out.println("------------------------------");
System.out.println("Parsed query: " + query.toString() + "\n");
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
if (hits.length == 0) {
System.out.println(" No hits found.");
System.out.println();
}
for (ScoreDoc hit : hits) {
System.out.println(" doc id = " + hit.doc);
System.out.println(" score = " + hit.score);
Document doc = searcher.storedFields().document(hit.doc);
System.out.println(" field = " + doc.get(FIELD_NAME));
System.out.println();
}
}
private static void printHits(String queryString, Analyzer analyzer)
throws IOException, ParseException {
QueryParser parser = new QueryParser(FIELD_NAME, analyzer);
Query query = parser.parse(queryString);
printHits(query);
}
}
|
Interval Functions
Something similar can be achieved using the Lucene StandardQueryParser
, together with interval functions.
Using interval functions, we can create a query string with the following structure:
1
2
3
4
5
6
7
8
| fn:notWithin(
foo
0
fn:or(
"foo bar"
"foo bat"
)
)
|
The fn:notWithin
function “matches intervals of the source that do not appear within the provided number of positions from the intervals of the reference.”
Arguments:
fn:notWithin(source positions reference)
source
source sub-interval (term or other function)
positions
an integer number of maximum positions between source and reference
reference
reference sub-interval (term or other function)
This is similar to our previous SpanNotQuery
.
The code:
1
2
3
| String queryString = "fn:notWithin(foo 0 fn:or(\"foo bar\" \"foo bat\"))";
StandardQueryParser parser = new StandardQueryParser(analyzer);
Query query = parser.parse(queryString, FIELD_NAME);
|
This query matches the same four documents from our original test data.
One difference from the earlier SpanNotQuery
approach: Scoring values are different (I don’t know why).