This focuses on Lucene indexes and the data they contain, not on searching. Some related information can also be found in this earlier post: Lucene Fields and Term Vectors.
The following notes are based on the official Lucene documentation found in the package summary for org.apache.lucene.index.
However, some of the code in that page, appears to be slightly out-of-date - and is also only provided as small fragments, which cannot be run without additional code. To see updated runnable code examples, take a look in GitHub: Lucene Indexes. The code there builds a small index and then investigates the resulting indexed data.
Some key terms and definitions:
Term | Definition |
---|---|
Segment | A segment contains a subset of all the documents in an index. Each segment is a complete and searchable index in its own right. The creation of new segments and the merging of smaller segments into fewer, larger segments is mananged by Lucene. |
Document | A document is the basic unit representing what gets indexed and searched. A document consists of one or more separate fields. Documents are assigned sequential internal IDs by Lucene. |
Field | A field consists of a field name and a textual value. There are many different predefined types of fields, serving different purposes. Custom fields can also be created. |
Term | A term is the basic unit of search. It is, roughly speaking, a word or token - but can also represent other items such as dates, URLs etc. |
Postings Index | The fundamental data structure used by Lucene for indexing and searching. By providing a field name and a search term, you can efficiently retrieve all of the documents which contain that search term for the requested field. It is described as an “inverted index” because it maps from content to documents - as opposed to a “forward” index which maps from documents to content. |
Posting | A posting is an entry in the Lucene inverted index for a given field name and search term. A Lucene posting may optionally contain data in addition to the document ID or IDs containing the term (for example, the frequency and position/offset of the term in the document). |
Stored Field | The opposite of a posting: it is an optional data structure separate from the inverted index, which allows the retrieval of a field’s value given a document ID. |
Term Vector | Term vectors are another separate (and optional) data structure stored in a Lucene index. They can store the same frequency and position/offset data as a posting in the inverted index. However, they organize this data by document, instead of by field (they are more akin to a “forward” index, in that regard). |
The above is a very basic list. For example, DocValues and PointValues are not discussed.
There are many more definitions of key Lucene terms in the Solr Reference Guide.
I don’t find the term “inverted index” to be especially helpful. It’s just an index, designed to support a specific type of query - one suitable for answering questions such as “show me where every occurrence of the word foobar
can be found”. In that sense, it’s just like the index of terms you find at the back of a reference book: it tells you all of the page numbers in the book where foobar
is mentioned.
If you are familiar with relational databases, then you are probably more familiar with indexes which support different types of queries: for example, an index on a column containing bank account numbers. In this case, the account number may be your starting point - and the index allows you to efficiently look up the details attached to a given bank account (e.g. the account holder’s name - assuming a very simplified database schema!). It’s a “forward index” (or just an “index”!) because it maps from an identifier (often a unique ID) to that ID’s related content.
The index of terms in a book, by contrast, works in the opposite direction (hence it is “inverted”).
The terms index and the account number index are both indexes: they just serve different purposes.
The following three separate index structures are presented. You can see that the first structure navigates from content (terms in fields) to documents. This is the inverted index structure. The other two structures navigate from documents to content.
Postings: field
> term
> doc
> freq/pos/offset
As already mentioned, this index data is the core structure used for full text searches. You search for a specific term in a specific field; and Lucene returns ranked document matches.
Stored fields: doc
> field
> term name/value
Stored fields are optional structures, but useful for storing a unique document identifier (but not recommended for storing large text strings). When you retrieve a set of matching documents using the inverted index (the postings data), you can then use the IDs of each found document to retrieve the stored field data.
Additional (optional) information such as term positions can be used to support proximity searches.
Term Vectors: doc
> field
> term
> freq/pos
> start/end offsets
Used to store frequency and position/offset data - but accessed starting with document IDs (not fields & terms). Useful for secondary processing such as results highlighting, after a set of matching documents has been returned by a search query.
The accompanying code for this article creates two basic documents:
Each document contains two fields - a title
field and a body
field.
The body
field is a custom field which specifies that a stored field should be created:
|
|
and also that term vector data should be created:
|
|
The field also stores additional information in the core posting index data:
|
|
This is not necessarily realistic, as it involves some duplicated information - but is useful for the demo code.
An example posting in the inverted index:
|
|
An example stored doc entry:
|
|
An example (abbreviated) term vector entry:
|
|
You can see the actual index data because the indexer uses the SimpleTextCodec
- NOT SUITABLE FOR PRODUCTION!
Just browse to the index
directory on your file system and open the relevant index file in a text file reader (such as Notepad++).
The GitHub code shows how you can access the indexed data once it has been created, to inspect the data programmatically.
Postings Data
For postings data, we start with an IndexReader
and iterate through each index segment using a LeafReader
.
The FieldInfos
class gives us access to each indexed field. From there we can access the terms in each field using the TermsEnum
class.
Finally, we access each posting in the index using PostingsEnum
which also allows us to access each relevant document and the specific posting data.
Data Seeker
For an additional example, you can also see how to use seekExact()
to find specific postings in the inverted index structure.
Other Structures
The stored fields code and the term vectors code are more straightforward.
Normally, you don’t need to use these techniques to access Lucene index structures. Instead, you just perform your searches and access the resulting documents, using Lucene queries (not shown in this code).
You can also use Luke (see the bottom of this page) to explore any compatible Lucene index.
However, these techniques are useful to gain a better understanding of how data is structured, and how Lucene itself accesses index data, behind the scenes, when executing queries.
The examples shown here are, of course, very basic. And Lucene often provides more than one way to perform such tasks - as well as providing many more sophisticated ways it can use its indexed data.