Lucene Fields and Term Vectors

30 Dec 2020

Table of Contents


Introduction

When defining fields and building indexes for Lucene documents, there are several configuration options which affect what data is indexed - and therefore how the index can be used. It’s not immediately obvious (at least it wasn’t to me) what the differences are between these options.

For example, when declaring a field such as a TextField you can declare it as stored or not stored:

  • Field.Store.YES
  • Field.Store.NO

The TextField class also has two static fields:

  • public static final FieldType TYPE_STORED
  • public static final FieldType TYPE_NOT_STORED

When creating a field directly from the base Field class, you can define the following index options:

  • IndexOptions.NONE
  • IndexOptions.DOCS
  • IndexOptions.DOCS_AND_FREQS
  • IndexOptions.DOCS_AND_FREQS_AND_POSITIONS
  • IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS

And also the following field types:

  • FieldType::setStored(true)
  • FieldType::setStoreTermVectors(true)
  • FieldType::setStoreTermVectorPayloads(true)
  • FieldType::setStoreTermVectorPositions(true)
  • FieldType::setStoreTermVectorOffsets(true)

To help understand what these all mean, we need to take a closer look at how an index is built.

SimpleTextCodec

To investigate Lucene fields and the various settings they can use, we will create an index using a codec which generates human-readable output (no binary data):

org.apache.lucene.codecs.simpletext.SimpleTextCodec

A Lucene codec implements an indexing format.

By default Lucene 8.6 uses the Lucene86Codec. You can see more details describing the index format of this codec here.

Another related Lucene term is “posting”. This represents the data written to an index for a particular entry (e.g. a particular token). A Lucene index consists of a list of postings, plus other related data.

The simple text codec can be provided by Maven as via the following package:

XML
1
2
3
4
5
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-codecs</artifactId>
    <version>8.6.3</version>
</dependency>

This codec should only be used in non-production scenarios, as an educational tool.

Codec Basic Usage

We can use this codec in our index writer as follows:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
final String indexPath = "E:/lucene/lucene-863/index";
final String docsPath = "E:/lucene/lucene-863/inputs";
final Path docDir = Paths.get(docsPath);
Directory dir = FSDirectory.open(Paths.get(indexPath));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
System.out.println(iwc.getCodec().getName());
try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
    // read documents, and write index data:
    indexDocs(writer, docDir);
}

My sample data consists of two text files:

1
2
3
4
5
E:\lucene\lucene-863\inputs\foo.txt:
Bravo! Alfa! Charlie!

E:\lucene\lucene-863\inputs\bar.txt:
Echo, Charlie, Delta - Echo.

In my case, the index data results in several files - but the one we are interested in here is *.scf - a “compound” file containing various “virtual file” sections, where our human-readable index data is stored.

Fields

Lucene provides a set of pre-configured fields for different purposes. You can see a list here. These are all syntactic sugar around the Field class itself.

We will first look at some of the pre-configured field types.

StringField

StringField: A field that is indexed but not tokenized: the entire String value is indexed as a single token:

Java
1
Field pathField = new StringField("path", file.toString(), Field.Store.YES);

This field is useful when the value being indexed is itself a pointer to the document - for example a file name, a database primary key value, and so on. This means that you can use the data stored in the Lucene index to then retrieve the original document data in a subsequent step.

It is also useful for finding exact matches, where all punctuation, whitespace, etc. are preserved.

Here is an example of what such data might look like in our SimpleTextCodec index:

For the “fields” (*.fld) section of the index, the value is stored in the index (Field.Store.YES):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
doc 0
  field 0
    name path
    type string
    value E:\\lucene\\lucene-863\\inputs\\bar.txt
doc 1
  field 0
    name path
    type string
    value E:\\lucene\\lucene-863\\inputs\\foo.txt

The above means that once we have a hit on a document (e.g. via other search criteria against other fields), we can retrieve the document value, based on the document’s ID:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ScoreDoc[] hits = indexSearcher.search(query, 100).scoreDocs;

for (ScoreDoc hit : hits) {
    BigDecimal score = new BigDecimal(String.valueOf(hit.score))
            .setScale(3, RoundingMode.HALF_EVEN);
    Document hitDoc = indexSearcher.doc(hit.doc);
    System.out.println(String.format("%s - %s",
            String.format("%7.3f", score),
            String.format("%-10s", hitDoc.get("path"))));
}

Specifically, in the above code, hitDoc.get("path") will retrieve the document’s path because that data was stored in the “fields” section of the index.

As well as IndexSearcher.doc(int) as shown above, you can also use IndexReader.document() to return the field and its value.

For the “tokens” (*.pst) section of the index, we see the following, because StringField values are stored as single tokens:

1
2
3
4
5
field path
  term E:\\lucene\\lucene-863\\inputs\\bar.txt
    doc 0
  term E:\\lucene\\lucene-863\\inputs\\foo.txt
    doc 1

TextField

TextField: A field that is indexed and tokenized, without term vectors:

Java
1
Field textField1 = new TextField("bodytext1", content, Field.Store.NO);

In this example, because we are using Field.Store.NO, (which is the default storage type), there is no data added to the “fields” section of the index.

We have indexed two documents:

Document 0: echo charlie delta echo
Document 1: bravo alfa charlie

We see the following in the “tokens” section:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
field bodytext1
  term alfa
    doc 1
      freq 1
      pos 1
  term bravo
    doc 1
      freq 1
      pos 0
  term charlie
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 2
  term delta
    doc 0
      freq 1
      pos 2
  term echo
    doc 0
      freq 2
      pos 0
      pos 3

Custom Fields

As mentioned earlier, StringField and TextField are two examples of pre-built fields provided by Lucene. You can build your own customized fields, also:

Java
1
2
3
4
5
6
FieldType fieldType = new FieldType();
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStored(false);    // default = false (same as Field.Store.NO)
fieldType.setTokenized(true);  // default = true (tokenize the content)
fieldType.setOmitNorms(false); // default = false (used when scoring)
Field contentField = new Field("content", "Lorem ipsum...", fieldType);

Here, IndexOptions can be one of the following:

DOCS
DOCS_AND_FREQS
DOCS_AND_FREQS_AND_POSITIONS
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
NONE

These all control the same index content that we saw above with the TextField example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
field content
  term ipsum
    doc 0
      freq 1
      pos 1
      startOffset 6
      endOffset 11
  term lorem
    doc 0
      freq 1
      pos 0
      startOffset 0
      endOffset 5

Now we see additional information stored in the index for startOffset and endOffset.

Why might we choose to capture - or not capture - some of these frequency, position, and offset data values in our index?

There are many possible reasons - but here is a quick example: Lucene provides a PhraseQuery which lets you search for specific sequences of words, such as the phrase “lorem ipsum” - that is to say, the token lorem followed by the token ipsum:

Java
1
2
3
4
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("content", "lorem"), 1);
builder.add(new Term("content", "ipsum"), 2);
PhraseQuery phraseQuery = builder.build();

If our index does not include position data, there is no way for Lucene to know that the position of ipsum does indeed immediately follow the position of lorem in our document.

In fact, Lucene throws an java.lang.IllegalStateException in this case, when we try to execute our phrase query:

field “content” was indexed without position data; cannot run PhraseQuery

The type of query requires this data to be present in the index.

Term Vectors

Term vectors are an alternative way to structure indexable data in a Lucene index. By default, this data is not stored - you have to explicitly ask for it to be created.

As we saw above, the core index used by Lucene has the following basic repeating structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
field
  term
    doc
      freq
      pos
      startOffset
      endOffset
  term
    doc
      freq
      pos
      startOffset
      endOffset

I will summarize this tokens (*.pst) section as:

field > term > doc > freq/pos/offset

Term vectors are stored in a different (*.vec) section using the following hierarchy:

doc > field > term > freq/pos > offset

To generate term vectors you must add them to your field type definition:

Java
1
myFieldType.setStoreTermVectors(true);

This causes frequency data to be stored in the term vector index. An entry will look like this:

1
2
3
4
5
doc 0
  field 1
    name content2
    term charlie
      freq 1

This structure can be summarized as:

doc > field > term > freq

You can save additional data in the term vector index structure with these settings:

Java
1
2
3
fieldType.setStoreTermVectors(true);         // captures frequencies
fieldType.setStoreTermVectorPositions(true); // depends on freqs
fieldType.setStoreTermVectorOffsets(true);   // depends on freqs

Note that you need to use setStoreTermVectors(true) in order to capture term vector positions and offsets. But you can capture positions without offsets and offsets without positions.

Assuming we have a document containing the following content:

echo charlie delta echo

…then a sample of the related section from the fully-populated term vector index is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
doc 0
  numfields 1
  field 1
    name content2
    positions true
    offsets   true
    payloads  false
    numterms 3
    term charlie
      freq 1
      position 1
        startoffset 6
        endoffset 13
    term delta
      freq 1
      position 2
        startoffset 15
        endoffset 20
    term echo
      freq 2
      position 0
        startoffset 0
        endoffset 4
      position 3
        startoffset 23
        endoffset 27

As we can see, this does not generate new information, but rather the same information in a different hierarchical structure from the main PST (token) index.

Term Vector Payloads

There is one additional setting:

Java
1
fieldType.setStoreTermVectorPayloads(true);

Payloads are a very different topic, and are not discussed here. So I will ignore this field for now.

Index Comparison

If we compare the two index structures (token index and term vector index), we can consider situations in which each may be used most effectively by Lucene:

PST: field > term > doc > freq/pos/offset
VEC: doc > field > term > freq/pos > offset

The PST structure is better for answering general queries, where we are typically asking for some value to be present in a specific field (or multiple combinations of fields and values). This is basic searching.

The VEC structure can be used when we have already found a document, and we want to know exactly where the search terms are located in that document - for example for results highlighting.

Indexed Offsets

An example method which shows how to access term vector offset data is provided in my lucene-term-vectors project on GitHub.

The getIndexedOffsets() method shows an example wich uses term vector data written during the indexing process. The code locates the offset statistics for a given term in the specified field of a document - for example, when processing the results of a search.

This uses existing term vector data stored in the Lucene index, as shown here.

The method is as follows:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
private void getIndexedOffsets(int docID, String fieldName, String searchTerm) throws IOException {
    try ( IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)))) {
        Terms terms = reader.getTermVector(docID, fieldName);
        TermsEnum termsEnum = terms.iterator();
        BytesRef bytesRef = new BytesRef(searchTerm);
        if (termsEnum.seekExact(bytesRef)) {
            PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS);
            pe.nextDoc();
            int freq = pe.freq();
            for (int i = 0; i < freq; i++) {
                pe.nextPosition();
                if (pe.startOffset() >= 0) {
                    printOffset(pe.startOffset(), pe.endOffset());
                }
            }
        }
    }
}

The code uses an IndexReader, which is created in the above method, but which could have been accessed from the IndexSearcher object used when we previously performed a search:

Java
1
indexSearcher.getIndexReader();

We access term vector data using reader.getTermVector(docID, fieldName). This gives us an iterable object for the given field in the given document. This matches our term vector index structure, discussed above.

We can then locate the specific term we need using:

Java
1
termsEnum.seekExact(bytesRef)

And this, in turn lets us access a postings enumerator:

Java
1
PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS);

The first parameter in the above method is set to null in our case, but can be used to directly provide (and re-use) a previously created PostingsEnum.

The second parameter is a flag representing which optional per-document values you want to access. Possible values are:

  • PostingsEnum.FREQS
  • PostingsEnum.POSITIONS
  • PostingsEnum.OFFSETS
  • PostingsEnum.PAYLOADS
  • PostingsEnum.ALL
  • PostingsEnum.NONE

For our demo, we only access the offsets data.

Dynamic Term Vectors

Storing term vector data during the analysis and indexing phase may result in a significantly larger index than if term vectors were not generated. One trade-off is to generate term vector data on-the-fly as it may be needed (e.g. to highlight specific search terms in one specific document).

This may slow down search performance, but can reduce index storage requirements. It’s a trade-off.

My lucene-term-vectors project on GitHub shows two approaches for accessing term vector data, one of which creates offset data on-the-fly, rather than writing it to an index.

Token Stream Offsets

The getDynamicOffsets() method generates offset data on-the-fly from a token stream - which is the same as the token stream used to analyze data during the index-building process.

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
private void getDynamicOffsets(String field, String searchTerm) throws IOException {
    Analyzer analyzer = new StandardAnalyzer();
    TokenStream ts = analyzer.tokenStream(field, content);
    CharTermAttribute charTermAttr = ts.addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAttr = ts.addAttribute(OffsetAttribute.class);

    try {
        ts.reset(); // Resets this stream to the beginning. (Required)
        System.out.println();
        System.out.println("token: " + searchTerm);
        while (ts.incrementToken()) {
            if (searchTerm.equals(charTermAttr.toString())) {
                printOffset(offsetAttr.startOffset(), offsetAttr.endOffset());
            }
        }
        ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
    } finally {
        ts.close(); // Release resources associated with this stream.
    }
}

In this example, we re-create the token stream from the analyzer we originally used.

The Lucene documentation presents a simple analyzer example here, or you can see a more detailed walk-through and code example in the Lucene Analysis package documentation.

Highlighting

Another apporach using Lucene’s highlighter classes is shown here.