Introduction

My demo web application, which I describe here (with sources here), displays a core set of IMDb data:

But the application has a fatal flaw (maybe it has several, but I’m going to focus on one):

It doesn’t scale.

The IMDb data set I am using contains over 6 million title records (movies, TV episodes, etc).  But my demo only handles a modest 5,000 of these, as highlighted above.  Similarly, my IMDb data set contains over 9.6 million people (actors, directors, producers, etc.), and almost 36 million records to describe which people appear in which titles.

Within my web app, the 5,000 titles it does contain are all displayed in a single HTML table (albeit with client-side paging, courtesy of DataTables). But when that table is created, all 5,000 titles are sent from the server.

This is OK, given the purposes of my web app - to explore various technologies such as Javalin and Thymeleaf.  But it’s not going to be practical in the real world, except for modestly sized data sets.

One solution already hinted at would be to introduce server-side paging: Fetch as much data as a user can see at one time, plus perhaps a little more.  And then fetch the next chunk of data only if needed.

That may be necessary - but it’s probably not sufficient.  Which brings us to the need for text searching.

Topics I will be covering in this post:

  • My Lucene search utility
  • Older Lucene versions and tutorials
  • Introduction to Lucene 8
  • My data sample
  • Building indexes using analyzers
  • Luke - the Lucene analysis tool
  • Using my indexes
  • How to run my sample code
  • Faster performance?
  • Typeahead and debounce
  • Highlighting search terms in results
  • Term vectors - (still not entrely sure when it’s best to use these)
  • The custom analyzer
  • Lucene alternatives

My Lucene Search Utility

I’ve never used Lucene’s API before - so I wanted to give it a try. Spoiler alert: I ended up with a small stand-alone application which uses typeahead to search all 6 million titles in my database:

Now we know how it ends, let’s look at the rest of the story.

Older Lucene Versions & Tutorials

I am using the most up-to-date release - which at the time of writing is 8.3.0 (released November 2019).

I ran into issues with some of the existing tutorials, because the most prominent ones use older versions of Lucene, which no longer compile against 8.3.0.

Examples:

Baeldung - uses 7.1.0 (released October 2017)

LuceneTutorial.com - uses 4.0.0 (released October 2012)

Tutorialspoint - uses v3.6.2 (released December 2012)

Lucene versions 7.5.0 and 8.0.0 saw a reasonably large number of API changes, including some breaking changes in core classes (which - to be clear - were documented, and preceded by deprecation warnings in previous releases).

Examples of breaking changes in version 8 include:

  • LUCENE-8356: StandardFilter and StandardFilterFactory were removed.

  • LUCENE-8373: StandardAnalyzer.ENGLISH_STOP_WORD_SET was removed.

  • The default (no-arg) constructor for StopAnalyzer was also removed.

  • StandardAnalyzer.STOP_WORDS_SET was removed.

I also looked at Hibernate, which has a Lucene integration module (Hibernate Search).  But the current version of that (5.11) uses Lucene 5.5.5, under the covers.  And anyway, I wanted to use Lucene directly, to start with.

(I note that Hibernate Search 6 - currently in alpha - will include an upgrade of Lucene to version 8.2).

Lucene Resources

Other resources which helped me:

The Solr web site - a great place to look for definitions and terminology relating to Lucene (since Solr is built on top of Lucene).  For example:

The ElasticSearch Reference web site can also help with terminology, for example:

I also looked at these sites - and although some of the code may be for older versions of Lucene, they were very helpful:

Introduction to Lucene 8

The best tutorial/resource I found for version 8 code was Lucene itself (surprise!). I downloaded the source code for the latest release from the “source release” link on this page:

https://lucene.apache.org/core/downloads.html

I then unzipped and untarred the file (lucene-8.3.0-src.tgz), and navigated to the demo directory:

…/lucene-8.3.0/demo/src/java/org/apache/lucene/demo

There, I took a look at IndexFiles.java, and SearchFiles.java, while reading the code overview provided here:

Overview Description

There’s lots more to explore on the Lucene web site - but this gave me a start.

And, of course, those other (older) resources are full of helpful training materials - just don’t expect their code to compile against the latest release.  No doubt, someday my code will also fail to compile against a future Lucene release.  So it goes.

My Data Sample

I already have a MySQL database containing the full set of IMDb data - see here for more info. For my indexing needs in this Lucene demo, I keep things very simple.  I use a SQL query to assemble all the data I want to index: the title, the director, actors, content type, and year. I concatenate it all into a single field, for simplicity.  This is tokenized, then placed in a Lucene document, along with the related title ID.  There’s not much more to it than that.

In the next section I will look at how I split up this data into tokens, as the basis for building full-text indexes.

All of the code is available on GitHub here.

Building the Indexes Using Analyzers

I built two different indexes, using two indexing strategies.  This was just to help me understand how the strategies affect the usefulness of each index for my searches.

This is a large topic - and one where I have barely scratched the surface. There is a daunting array of options for creating indexes, using different analyzers and filters.

To begin with, let’s look at the StandardAnalyzer.  This may well be all you need.

1
StandardAnalyzer standardAnalyzer = new StandardAnalyzer()

This automatically uses:

  1. StandardTokenizer, which splits text into words based on Unicode text segmentation rules (e.g. splitting on white spaces, and removing punctuation in the process).
  2. LowerCaseFilter, which does what the name suggests.
  3. Removes stopwords.  

For point (3): By default, the stopword list is empty.  You can provide a list yourself, or use a predefined list from Lucene, or some combination of the above.  For example:

1
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET)

I have further examples later on.

Here is a StandardAnalyzer example taken directly from the official Unicode text segmentation document:

Input text: The quick ("brown") fox can't jump 32.3 feet, right?

After processing by the Lucene standard analyzer, we have the following tokens available to be indexed:

the, quick, brown, fox, can't,jump, 32.3, feet, right,

Here are the two analyzers I created.  Each analyzer represents a set of transformations to be applied to my raw data, before the data is indexed:

My SimpleTokenAnalyzer:

1
2
3
4
5
6
7
8
9
@Override  
protected StopwordAnalyzerBase.TokenStreamComponents createComponents(String fieldName) {  
    final Tokenizer source = new StandardTokenizer();  
    TokenStream tokenStream = source;  
    tokenStream = new LowerCaseFilter(tokenStream);  
    tokenStream = new StopFilter(tokenStream, EnglishAnalyzer.ENGLISH_STOP_WORDS_SET);  
    tokenStream = new ASCIIFoldingFilter(tokenStream);  
    return new StopwordAnalyzerBase.TokenStreamComponents(source, tokenStream);  
}

In the above analyzer (full code here), my input data is converted to lower case, then stop-words are removed, and finally accents are removed (along with other ASCII folding mappings).  The resulting string of text is then split into a set of tokens (as described above) - and each token is indexed.

Here is the second:

My Ngram35Analyzer:

1
2
3
4
5
6
7
8
@Override  
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {  
    Tokenizer source = new StandardTokenizer();  
    TokenStream tokenStream = new LowerCaseFilter(source);  
    tokenStream = new ASCIIFoldingFilter(tokenStream);  
    tokenStream = new NGramTokenFilter(tokenStream, 3, 5, true);  
    return new Analyzer.TokenStreamComponents(source, tokenStream);  
}

The above analyzer (full code here) excludes stopword removal, but includes an extra step: It splits the output into ngram tokens, between 3 and 5 characters in length.  For example, the word hello is split into the following tokens for indexing: hel, ell, llo, hell, ello, hello.

One advantage of using ngrams is that we can now search using partial words - without needing a wildcard at the beginning of our search term (which would defeat the purpose of having an index in the first place).  This means that the index would allow us to enter father as our search term, and would find titles containing godfather, grandfather, fatherhood and so on.

Of course, this means we are indexing more data than in our first example (where hello was simply indexed as hello).

See later in this post for step-by-step instructions on using my code to build the above two indexes.

Luke - The Lucene Analysis Tool

In both cases, I created my index data in a filesystem directory (see the IndexBuilder class).  One advantage of this is that I can point Luke at this directory and explore the index data.

Luke is an analysis tool for Lucene.  It’s available as a JAR file (lucene-luke-8.x.x.jar), which can be found in the main Lucene binary release package (downloadable from this page).

Run the luke.bat or luke.sh script provided with the JAR to launch Luke. Then point Luke at your index directory:

This is a great way to explore the data.

In the above screenshot, you can see that there are two fields in each of my indexed documents:

  • title_id
  • title_data

Again, my implementation is very basic:  When creating the indexes, I concatenated together all of the data I wanted to index (title, actors, etc). That data was tokenized, and stored in the title_data field in my index.  Note that the original data is not stored - just the data after being converted to tokens by my analyzer (you can see that in the code here).

I also store the title ID in each indexed document - because that is the primary key value which will allow me to join my index results back to the rest of my relational data. More on that below.

Using the Indexes

When I enter a search term in my web page, a two-step process is executed:

  1. Lucene inspects my index data, finds matches for my search term, and ranks them from most relevant downwards (it provides a matching score for this).  Lucene returns a set of results, which includes the title_id I stored alongside my indexed data.

  2. My code takes the title_id from each Lucene result and executes a SQL query to retrieve the relational data for the top 100 matched results. It uses a simple prepared statement to do this, of the form:

1
2
3
select...   
from...   
where title_id in ('id_1', 'id_2', ... 'id_n')  

Those results are then presented back to the user.

This is far from the only way to use Lucene search results - in fact it’s probably the most basic.  For example, with a more structured Lucene index document (one which doesn’t concatenate all my indexable data into only one field), it could be possible to avoid step 2 altogether.

Using My Sample Code

My sample code consists of a web server (based on Javalin) and a client web page where searches are initiated and results are displayed (provided by a stand-alone HTML text file).

Search results are sent from the server to the client via JSON - Javalin has built-in support for this which makes the process straightforward.

Before building and running the server, you will need to make two changes to the code:

  1. Provide the location of a new empty directory, here and here (sorry, same thing in 2 places) - this is where Lucene will store and access its indexing data.

  2. Provide your database’s credentials for access to the IMDb data (assuming you have already set that up).  My code assumes you are using a MySQL database.

Build and run the web server using the usual Maven commands (not discussed here).

To build the SimpleTokenAnalyzer index, open a web browser and go to the following URL:

http://localhost:7000/build_index/simple

Progress will be shown in the output terminal of the server process.  The process takes approximately one hour on my PC.  At the end, you should see something like this:

17:20:35.519 [INFO ] - Starting - simple index will be built...  
...  
18:14:18.451 [INFO ] - Indexed 6,030,000 documents  
18:14:18.451 [INFO ] - Approx 0 minutes remaining  
18:14:22.964 [INFO ] - Indexed 6,040,000 documents  
18:14:22.964 [INFO ] - Approx 0 minutes remaining  
18:14:27.088 [INFO ] - Indexed 6,048,014 documents in 54 minutes  
18:14:27.088 [INFO ] - at a rate of approx. 1,875 documents per second.  
18:14:27.088 [INFO ] - Finished.  

The final index consists of 338 MB of data.

To build the Ngram35Analyzer index, use this URL:

http://localhost:7000/build_index/ngram

The output will again take about an hour:

18:16:55.495 [INFO ] - Starting - ngram index will be built...  
...  
19:17:19.462 [INFO ] - Indexed 6,030,000 documents  
19:17:19.462 [INFO ] - Approx 0 minutes remaining  
19:17:24.383 [INFO ] - Indexed 6,040,000 documents  
19:17:24.383 [INFO ] - Approx 0 minutes remaining  
19:17:29.195 [INFO ] - Indexed 6,048,014 documents in 60 minutes  
19:17:29.195 [INFO ] - at a rate of approx. 1,667 documents per second.  
19:17:29.195 [INFO ] - Finished.

This index is much larger - 1.83 GB of data (a consequence of my ngram configuration choices).

Faster Performance

Lucene is capable of much faster indexing speeds than those shown above - my index creation process is basic: One document at a time.  That’s about the slowest approach of all.

The Lucene source code package (where the basic demo code is provided) includes examples of how to optimize Lucene performance.

For example, Lucene can index multiple documents placed in one file.  See the WriteLineDocTask.java example here:

.../lucene-8.3.0/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.java

As with Analyzers and Filters, this is a very large topic - and outside the scope of this simple post.

Typeahead and Debounce

The client web page I am using in this demo uses a typeahead technique: As the user types a search term into the input field, that data is sent to the server via an AJAX call, and results are returned while the user is still typing.

This feature is sometimes seen in autocomplete situations, such as drop-down lists:  The user has a drop-down list of country names.  The list shrinks, as the user types more text.

Because the user’s search data is sent to the server via a DOM keyup event, it is possible for multiple events to be generated in rapid succession.  The debounce library…

1
$("#input_1").keyup($.debounce(300, callTitleApi));

…is used to ensure that only one event is submitted to the server every 300 milliseconds at most.  300 is a good value for my environment - it may not be suitable in other environments.

(There is also this typeahead.js library - which I have never used. And probably others, too.)

Just to be clear - there is no need to use typeahead with Lucene. They are two separate concepts.

Highlighting Search Terms in Results

One thing not included in my lookup utility is Lucene’s ability to highlight search terms in search results.  This can be useful in helping a user to understand why a record was returned by a search - especially if the data being returned is a large document rather than a small database record.

The following sample shows the core of how Lucene handles highlighting.  It assumes that the data being highlighted is included as part of the index (the simplest case):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import java.io.IOException;  
import java.util.Properties;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.analysis.TokenStream;  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;  
import org.apache.lucene.search.highlight.Highlighter;  
import org.apache.lucene.search.highlight.QueryScorer;  
import org.apache.lucene.search.highlight.TokenSources;  
import org.apache.lucene.search.highlight.TextFragment;  
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;  

public class CustomHighlighter {  

    private static final String PRE_TAG = "<span class=\"hilite\">";  
    private static final String POST_TAG = "</span>";  

    public static String highlight(Query query, TopDocs results,  
            IndexSearcher searcher, Analyzer analyzer, ScoreDoc hit,  
            Properties props)  
            throws IOException, InvalidTokenOffsetsException {  
        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter(PRE_TAG, POST_TAG);  
        Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));  
        int id = hit.doc;  
        Document doc = searcher.doc(id);  

        String text = doc.get("your field name here");  
        String highlightedText = null;  

        // let's highlight that text:  
        TokenStream tokenStream = TokenSources.getTokenStream("your field name here",  
                searcher.getIndexReader().getTermVectors(id), text, analyzer, -1);  
        TextFragment[] frags = highlighter.getBestTextFragments(tokenStream, text, false, 10);  

        for (TextFragment frag : frags) {  
            if ((frag != null) && (frag.getScore() > 0)) {  
                highlightedText = frag.toString();  
            }  
        }  
        return  highlightedText;  
    }  

}

The above code also assumes that text is being highlighted for display in HTML - hence the use of a span and a class to control the appearance of highlighting.

Highlighting for simple text searches is straightforward - and you may even consider applying it outside of Lucene altogether.  But for more complicated searches and complex analyzers (e.g. overlapping search terms, wildcards, regular expressions, and other customizations), it can become challenging.

Also, if your raw data is not stored alongside your indexed data in a Lucene index, then in my above example this line:

1
String text = doc.get("your field name here");

will return a null value.  In this case, Lucene will require additional data (term vector information) in order to find the correct location of the data to be highlighted.

That is outside the scope of this article - because I am still figuring out what term vectors are and how and when they should (and should not) be used.

Term Vectors

When you create a document for indexing, you  must give each field in the document a name, and you must indicate what type of data it is, from a set of Lucene data types (text, numeric, date, etc).

So, in my lookup utility, I have a simple document defined as follows:

1
2
doc.add(new Field("title_id", title_id, TextField.TYPE_STORED));  
doc.add(new Field("title_data", title_data, TextField.TYPE_NOT_STORED));

However, you can also choose to create document fields with additional term vector attributes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
FieldType tvType = new FieldType();  
// Indexes documents, frequencies, positions and offsets:  
tvType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);  
// field's value is NOT stored in the index:  
tvType.setStored(false);  
// store index data in term vectors:  
tvType.setStoreTermVectors(true);   
// store token character offsets in term vectors:  
tvType.setStoreTermVectorOffsets(true);   
// store token positions in term vectors:  
tvType.setStoreTermVectorPositions(true);  
// do NOT store token payloads into the term vector:  
tvType.setStoreTermVectorPayloads(false);  

doc.add(new Field("foo", foo, tvType));

What is this data?  Luke can show us:

As we can see, it is a set of frequencies, positions, and offsets for each indexed tern.

In my earlier CustomHighlighter example, I could have accessed term vector data as follows (assuming term vector data had been specified when the index document was defined, of course):

1
Fields fields = searcher.getIndexReader().getTermVectors(id);

The Custom Analyzer

In my demo lookup service, I used two analyzers: SimpleTokenAnalyzer and Ngram35Analyzer.  I used class references to assemble the various analyzers, tokenizers and token filters I wanted to use.

But there is an alternative approach which uses the CustomAnalyzer class, together with the builder pattern, to provide a more expressive way to define analyzers.  An example:

1
2
3
4
5
Analyzer analyzer = CustomAnalyzer.builder()  
        .withTokenizer("icu")  
        .addTokenFilter("lowercase")  
        .addTokenFilter("porterstem")  
        .build();

Valid names (“lowercase” etc.) are available from the NAME field of the related builders.

If you want a cheat sheet of all names, then one can be generated using the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import static com.google.common.truth.Truth.assertThat;  
import org.apache.lucene.analysis.util.TokenizerFactory;  
import org.apache.lucene.analysis.util.TokenFilterFactory;  
import org.apache.lucene.analysis.util.CharFilterFactory;  

public class FactoryNamesLister {  

    public void factoryNamesLister() {  

        System.out.println("Available Tokenizers:");  
        TokenizerFactory.availableTokenizers().forEach((item) -> {  
            String className = TokenizerFactory.lookupClass(item).getCanonicalName();  
            System.out.println(item + "\t" + className);  
        });  

        System.out.println("Available Char Filters:");  
        CharFilterFactory.availableCharFilters().forEach((item) -> {  
            String className = CharFilterFactory.lookupClass(item).getCanonicalName();  
            System.out.println(item + "\t" + className);  
        });  

        System.out.println("Available Token Filters:");  
        TokenFilterFactory.availableTokenFilters().forEach((item) -> {  
            String className = TokenFilterFactory.lookupClass(item).getCanonicalName();  
            System.out.println(item + "\t" + className);  
        });  

    }  
}

And the output (pasted into Excel) looks like this:

Lucene Alternatives

My Lucene indexes allow me to provide full-text search functionality across all 6 million of my IMDb titles - it even supports “immediate feedback” searches through its implementation of typeahead.

My demo is limited, of course.  There is no logic to maintain the index, if data changes.  But that was not the objective of this modest demo.

There are alternatives to Lucene.

As has already been mentioned, Lucene is available via Hibernate. This may be a good choice if you already use Hibernate.

Lucene is used in Solr. Whereas Lucene is essentially a Java library (albeit a very large and sophisticated library), Solr is a fully-fledged application sitting on top of Lucene.  As such, it can - among other things - help with the management of your Lucene infrastructure.  The more I look at Lucene, the more that appears to be a potentially complex undertaking.

Elasticsearch - beyond knowing that it exists, and is also based on Lucene, I don’t know enough about it to comment. Worth mentioning: Kibana is a data visualization tool for Elasticsearch.I’ve used it briefly and it looked impressive.

You may also get what you need from using text search capabilities built into your RDBMS.  MySQL is one example, although the search options and syntax are relatively limited compared to Lucene.