Lucene Fields and Term Vectors
Contents
Introduction
When defining fields and building indexes for Lucene documents, there are several configuration options which affect what data is indexed - and therefore how the index can be used. It’s not immediately obvious (at least it wasn’t to me) what the differences are between these options.
For example, when declaring a field such as a TextField
you can declare it as stored or not stored:
Field.Store.YES
Field.Store.NO
The TextField
class also has two static fields:
public static final FieldType TYPE_STORED
public static final FieldType TYPE_NOT_STORED
When creating a field directly from the base Field
class, you can define the following index options:
IndexOptions.NONE
IndexOptions.DOCS
IndexOptions.DOCS_AND_FREQS
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
And also the following field types:
FieldType::setStored(true)
FieldType::setStoreTermVectors(true)
FieldType::setStoreTermVectorPayloads(true)
FieldType::setStoreTermVectorPositions(true)
FieldType::setStoreTermVectorOffsets(true)
To help understand what these all mean, we need to take a closer look at how an index is built.
SimpleTextCodec
To investigate Lucene fields and the various settings they can use, we will create an index using a codec which generates human-readable output (no binary data):
org.apache.lucene.codecs.simpletext.SimpleTextCodec
A Lucene codec implements an indexing format.
By default Lucene 8.6 uses the Lucene86Codec
. You can see more details describing the index format of this codec here.
Another related Lucene term is “posting”. This represents the data written to an index for a particular entry (e.g. a particular token). A Lucene index consists of a list of postings, plus other related data.
The simple text codec can be provided by Maven as via the following package:
|
|
This codec should only be used in non-production scenarios, as an educational tool.
Codec Basic Usage
We can use this codec in our index writer as follows:
|
|
My sample data consists of two text files:
|
|
In my case, the index data results in several files - but the one we are interested in here is *.scf
- a “compound” file containing various “virtual file” sections, where our human-readable index data is stored.
Fields
Lucene provides a set of pre-configured fields for different purposes. You can see a list here. These are all syntactic sugar around the Field
class itself.
We will first look at some of the pre-configured field types.
StringField
StringField
: A field that is indexed but not tokenized: the entire String value is indexed as a single token:
|
|
This field is useful when the value being indexed is itself a pointer to the document - for example a file name, a database primary key value, and so on. This means that you can use the data stored in the Lucene index to then retrieve the original document data in a subsequent step.
It is also useful for finding exact matches, where all punctuation, whitespace, etc. are preserved.
Here is an example of what such data might look like in our SimpleTextCodec
index:
For the “fields” (*.fld
) section of the index, the value is stored in the index (Field.Store.YES
):
|
|
The above means that once we have a hit on a document (e.g. via other search criteria against other fields), we can retrieve the document value, based on the document’s ID:
|
|
Specifically, in the above code, hitDoc.get("path")
will retrieve the document’s path because that data was stored in the “fields” section of the index.
As well as IndexSearcher.doc(int)
as shown above, you can also use IndexReader.document()
to return the field and its value.
For the “tokens” (*.pst
) section of the index, we see the following, because StringField
values are stored as single tokens:
|
|
TextField
TextField
: A field that is indexed and tokenized, without term vectors:
|
|
In this example, because we are using Field.Store.NO
, (which is the default storage type), there is no data added to the “fields” section of the index.
We have indexed two documents:
Document 0: echo charlie delta echo
Document 1: bravo alfa charlie
We see the following in the “tokens” section:
|
|
Custom Fields
As mentioned earlier, StringField
and TextField
are two examples of pre-built fields provided by Lucene. You can build your own customized fields, also:
|
|
Here, IndexOptions
can be one of the following:
DOCS
DOCS_AND_FREQS
DOCS_AND_FREQS_AND_POSITIONS
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
NONE
These all control the same index content that we saw above with the TextField
example.
|
|
Now we see additional information stored in the index for startOffset
and endOffset
.
Why might we choose to capture - or not capture - some of these frequency, position, and offset data values in our index?
There are many possible reasons - but here is a quick example: Lucene provides a PhraseQuery
which lets you search for specific sequences of words, such as the phrase “lorem ipsum” - that is to say, the token lorem
followed by the token ipsum
:
|
|
If our index does not include position data, there is no way for Lucene to know that the position of ipsum
does indeed immediately follow the position of lorem
in our document.
In fact, Lucene throws an java.lang.IllegalStateException
in this case, when we try to execute our phrase query:
field “content” was indexed without position data; cannot run PhraseQuery
The type of query requires this data to be present in the index.
Term Vectors
Term vectors are an alternative way to structure indexable data in a Lucene index. By default, this data is not stored - you have to explicitly ask for it to be created.
As we saw above, the core index used by Lucene has the following basic repeating structure:
|
|
I will summarize this tokens (*.pst
) section as:
field > term > doc > freq/pos/offset
Term vectors are stored in a different (*.vec
) section using the following hierarchy:
doc > field > term > freq/pos > offset
To generate term vectors you must add them to your field type definition:
|
|
This causes frequency data to be stored in the term vector index. An entry will look like this:
|
|
This structure can be summarized as:
doc > field > term > freq
You can save additional data in the term vector index structure with these settings:
|
|
Note that you need to use setStoreTermVectors(true)
in order to capture term vector positions and offsets. But you can capture positions without offsets and offsets without positions.
Assuming we have a document containing the following content:
echo charlie delta echo
…then a sample of the related section from the fully-populated term vector index is:
|
|
As we can see, this does not generate new information, but rather the same information in a different hierarchical structure from the main PST (token) index.
Term Vector Payloads
There is one additional setting:
|
|
Payloads are a very different topic, and are not discussed here. So I will ignore this field for now.
Index Comparison
If we compare the two index structures (token index and term vector index), we can consider situations in which each may be used most effectively by Lucene:
PST: field > term > doc > freq/pos/offset
VEC: doc > field > term > freq/pos > offset
The PST structure is better for answering general queries, where we are typically asking for some value to be present in a specific field (or multiple combinations of fields and values). This is basic searching.
The VEC structure can be used when we have already found a document, and we want to know exactly where the search terms are located in that document - for example for results highlighting.
Indexed Offsets
An example method which shows how to access term vector offset data is provided in my lucene-term-vectors project on GitHub.
The getIndexedOffsets()
method shows an example wich uses term vector data written during the indexing process. The code locates the offset statistics for a given term in the specified field of a document - for example, when processing the results of a search.
This uses existing term vector data stored in the Lucene index, as shown here.
The method is as follows:
|
|
The code uses an IndexReader
, which is created in the above method, but which could have been accessed from the IndexSearcher
object used when we previously performed a search:
|
|
We access term vector data using reader.getTermVector(docID, fieldName)
. This gives us an iterable object for the given field in the given document. This matches our term vector index structure, discussed above.
We can then locate the specific term we need using:
|
|
And this, in turn lets us access a postings enumerator:
|
|
The first parameter in the above method is set to null
in our case, but can be used to directly provide (and re-use) a previously created PostingsEnum
.
The second parameter is a flag representing which optional per-document values you want to access. Possible values are:
PostingsEnum.FREQS
PostingsEnum.POSITIONS
PostingsEnum.OFFSETS
PostingsEnum.PAYLOADS
PostingsEnum.ALL
PostingsEnum.NONE
For our demo, we only access the offsets data.
Dynamic Term Vectors
Storing term vector data during the analysis and indexing phase may result in a significantly larger index than if term vectors were not generated. One trade-off is to generate term vector data on-the-fly as it may be needed (e.g. to highlight specific search terms in one specific document).
This may slow down search performance, but can reduce index storage requirements. It’s a trade-off.
My lucene-term-vectors project on GitHub shows two approaches for accessing term vector data, one of which creates offset data on-the-fly, rather than writing it to an index.
Token Stream Offsets
The getDynamicOffsets()
method generates offset data on-the-fly from a token stream - which is the same as the token stream used to analyze data during the index-building process.
|
|
In this example, we re-create the token stream from the analyzer we originally used.
The Lucene documentation presents a simple analyzer example here, or you can see a more detailed walk-through and code example in the Lucene Analysis package documentation.
Highlighting
Another apporach using Lucene’s highlighter classes is shown here.
Author northCoder
LastMod 30-Dec-2020