Lucene 8 Custom Analyzer SPI Names

21 Nov 2019

Table of Contents


SPI Names in Lucene 8.3.0

The below SPI (service provider interface) names can be used to build custom analyzers, for example:

Java
1
2
3
4
5
Analyzer analyzer = CustomAnalyzer.builder()  
        .withTokenizer("icu")  
        .addTokenFilter("lowercase")  
        .addTokenFilter("asciiFolding")  
        .build();  

This can be much more succinct than using the related classes directly.

The Custom Analyzer

Here is a brief look at the CustomAnalyzer class used in the previous code example.

Some examples of using this class are provided in the JavaDoc.

For example, if you want to add a list of stop words to your analyzer, you can do so in the following ways:

Java
1
2
3
4
.addTokenFilter(StopFilterFactory.NAME,
                "ignoreCase", "false",
                "words", "stopwords.txt",
                "format", "wordset")

Or, by using the StopFilter’s SPI name (which is stop):

Java
1
2
3
4
.addTokenFilter("stop",
                "ignoreCase", "false",
                "words", "stopwords.txt",
                "format", "wordset")

Or by passing in the key/value parameters in a Map:

Java
1
2
3
Map<String, String> stopMap = new HashMap<>();
stopMap.put("words", "stopwords.txt");
stopMap.put("format", "wordset");

And then:

Java
1
.addTokenFilter("stop", stopMap)

Those key/value parameters (e.g. for ignoreCase, words and format) are defined in the related filter factory - so, for example, for the StopFilterFactory, they are documented here.

Don’t forget, in the case of stop words, the file (e.g. stopwords.txt) is expected to be on the classpath - for example, in the default package of your application.

Tokenizers

SPI Name Class Name
classic ClassicTokenizerFactory
edgeNGram EdgeNGramTokenizerFactory
icu ICUTokenizerFactory
keyword KeywordTokenizerFactory
letter LetterTokenizerFactory
nGram NGramTokenizerFactory
pathHierarchy PathHierarchyTokenizerFactory
pattern PatternTokenizerFactory
simplePattern SimplePatternTokenizerFactory
simplePatternSplit SimplePatternSplitTokenizerFactory
standard StandardTokenizerFactory
thai ThaiTokenizerFactory
uax29UrlEmail UAX29URLEmailTokenizerFactory
whitespace WhitespaceTokenizerFactory
wikipedia WikipediaTokenizerFactory

Char Filters

SPI Name Class Name
htmlStrip HTMLStripCharFilterFactory
icuNormalizer2 ICUNormalizer2CharFilterFactory
mapping MappingCharFilterFactory
patternReplace PatternReplaceCharFilterFactory
persian PersianCharFilterFactory

Token Filters

SPI Name Class Name
apostrophe ApostropheFilterFactory
arabicNormalization ArabicNormalizationFilterFactory
arabicStem ArabicStemFilterFactory
asciiFolding ASCIIFoldingFilterFactory
bengaliNormalization BengaliNormalizationFilterFactory
bengaliStem BengaliStemFilterFactory
brazilianStem BrazilianStemFilterFactory
bulgarianStem BulgarianStemFilterFactory
capitalization CapitalizationFilterFactory
cjkBigram CJKBigramFilterFactory
cjkWidth CJKWidthFilterFactory
classic ClassicFilterFactory
codepointCount CodepointCountFilterFactory
commonGrams CommonGramsFilterFactory
commonGramsQuery CommonGramsQueryFilterFactory
concatenateGraph ConcatenateGraphFilterFactory
czechStem CzechStemFilterFactory
dateRecognizer DateRecognizerFilterFactory
decimalDigit DecimalDigitFilterFactory
delimitedPayload DelimitedPayloadTokenFilterFactory
delimitedTermFrequency DelimitedTermFrequencyTokenFilterFactory
dictionaryCompoundWord DictionaryCompoundWordTokenFilterFactory
edgeNGram EdgeNGramFilterFactory
elision ElisionFilterFactory
englishMinimalStem EnglishMinimalStemFilterFactory
englishPossessive EnglishPossessiveFilterFactory
fingerprint FingerprintFilterFactory
finnishLightStem FinnishLightStemFilterFactory
fixBrokenOffsets FixBrokenOffsetsFilterFactory
fixedShingle FixedShingleFilterFactory
flattenGraph FlattenGraphFilterFactory
frenchLightStem FrenchLightStemFilterFactory
frenchMinimalStem FrenchMinimalStemFilterFactory
galicianMinimalStem GalicianMinimalStemFilterFactory
galicianStem GalicianStemFilterFactory
germanLightStem GermanLightStemFilterFactory
germanMinimalStem GermanMinimalStemFilterFactory
germanNormalization GermanNormalizationFilterFactory
germanStem GermanStemFilterFactory
greekLowercase GreekLowerCaseFilterFactory
greekStem GreekStemFilterFactory
hindiNormalization HindiNormalizationFilterFactory
hindiStem HindiStemFilterFactory
hungarianLightStem HungarianLightStemFilterFactory
hunspellStem HunspellStemFilterFactory
hyphenatedWords HyphenatedWordsFilterFactory
hyphenationCompoundWord HyphenationCompoundWordTokenFilterFactory
icuFolding ICUFoldingFilterFactory
icuNormalizer2 ICUNormalizer2FilterFactory
icuTransform ICUTransformFilterFactory
indicNormalization IndicNormalizationFilterFactory
indonesianStem IndonesianStemFilterFactory
irishLowercase IrishLowerCaseFilterFactory
italianLightStem ItalianLightStemFilterFactory
kStem KStemFilterFactory
keepWord KeepWordFilterFactory
keywordMarker KeywordMarkerFilterFactory
keywordRepeat KeywordRepeatFilterFactory
latvianStem LatvianStemFilterFactory
length LengthFilterFactory
limitTokenCount LimitTokenCountFilterFactory
limitTokenOffset LimitTokenOffsetFilterFactory
limitTokenPosition LimitTokenPositionFilterFactory
lowercase LowerCaseFilterFactory
minHash MinHashFilterFactory
nGram NGramFilterFactory
norwegianLightStem NorwegianLightStemFilterFactory
norwegianMinimalStem NorwegianMinimalStemFilterFactory
numericPayload NumericPayloadTokenFilterFactory
patternCaptureGroup PatternCaptureGroupFilterFactory
patternReplace PatternReplaceFilterFactory
persianNormalization PersianNormalizationFilterFactory
porterStem PorterStemFilterFactory
portugueseLightStem PortugueseLightStemFilterFactory
portugueseMinimalStem PortugueseMinimalStemFilterFactory
portugueseStem PortugueseStemFilterFactory
protectedTerm ProtectedTermFilterFactory
removeDuplicates RemoveDuplicatesTokenFilterFactory
reverseString ReverseStringFilterFactory
russianLightStem RussianLightStemFilterFactory
scandinavianFolding ScandinavianFoldingFilterFactory
scandinavianNormalization ScandinavianNormalizationFilterFactory
serbianNormalization SerbianNormalizationFilterFactory
shingle ShingleFilterFactory
snowballPorter SnowballPorterFilterFactory
soraniNormalization SoraniNormalizationFilterFactory
soraniStem SoraniStemFilterFactory
spanishLightStem SpanishLightStemFilterFactory
spanishMinimalStem SpanishMinimalStemFilterFactory
stemmerOverride StemmerOverrideFilterFactory
stop StopFilterFactory
swedishLightStem SwedishLightStemFilterFactory
synonym SynonymFilterFactory
synonymGraph SynonymGraphFilterFactory
tokenOffsetPayload TokenOffsetPayloadTokenFilterFactory
trim TrimFilterFactory
truncate TruncateTokenFilterFactory
turkishLowercase TurkishLowerCaseFilterFactory
type TypeTokenFilterFactory
typeAsPayload TypeAsPayloadTokenFilterFactory
typeAsSynonym TypeAsSynonymFilterFactory
uppercase UpperCaseFilterFactory
wordDelimiter WordDelimiterFilterFactory
wordDelimiterGraph WordDelimiterGraphFilterFactory

Helper Code to Extract All Names

The raw content of each table was generated by the following code:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import org.apache.lucene.analysis.util.TokenizerFactory;  
import org.apache.lucene.analysis.util.TokenFilterFactory;  
import org.apache.lucene.analysis.util.CharFilterFactory;  
import java.util.Map;  
import java.util.TreeMap;  

public class Main {  

    public static void main(String[] args) {  
        Main main = new Main();  
        main.factoryNamesLister();  
    }  

    private void factoryNamesLister() {  
        Map map = new TreeMap();  
        TokenizerFactory.availableTokenizers().forEach((item) -> {  
            map.put(item, TokenizerFactory.lookupClass(item).getCanonicalName());  
        });  
        printTableRows(map);  

        map.clear();  
        CharFilterFactory.availableCharFilters().forEach((item) -> {  
            map.put(item, CharFilterFactory.lookupClass(item).getCanonicalName());  
        });  
        printTableRows(map);  

        map.clear();  
        TokenFilterFactory.availableTokenFilters().forEach((item) -> {  
            map.put(item, TokenFilterFactory.lookupClass(item).getCanonicalName());  
        });  
        printTableRows(map);  
    }  

    private void printTableRows(Map<String, String> map) {  
        StringBuilder sb = new StringBuilder();  
        map.entrySet().forEach((entry) -> {  
            String item = entry.getKey();  
            String fqClassName = entry.getValue();  
            String className = fqClassName.split("\\.")[fqClassName.split("\\.").length - 1];  
            String url = classNameToUrl(fqClassName);  
            sb.append("<tr><td>").append(item).append("</td><td>")  
                    .append("<a href=\"").append(url).append("\">")  
                    .append(className).append("</a></td></tr>\n");  
        });  
        System.out.println(sb.toString());  
        System.out.println();  
    }  

    private String classNameToUrl(String className) {  
        final String baseUrl = "https://lucene.apache.org/core/8_3_0/";  
        // assume all relevant fully-qualified class names start with this:  
        final String pathStart = "org.apache.lucene.analysis";  
        if (className == null || className.isBlank() || !className.startsWith(pathStart)) {  
            return "";  
        }  

        String[] classParts = className.split("\\.");  
        if (classParts.length < 6) {  
            return "";  
        }  

        // For example, converts this:  
        // "org.apache.lucene.analysis.core.KeywordTokenizerFactory"  
        // to this:  
        // "org/apache/lucene/analysis/core/KeywordTokenizerFactory.html"  
        String classAsPath = String.join("", className.replaceAll("\\.", "/"), ".html");  

        StringBuilder sb = new StringBuilder();  
        sb.append(baseUrl);  

        switch (classParts[4]) {  
            case "icu":  
                sb.append("analyzers-icu/").append(classAsPath);  
                break;  
            default:  
                sb.append("analyzers-common/").append(classAsPath);  
        }  
        return sb.toString();  
    }  
}