SPI Names in Lucene 8.3.0

The below SPI (service provider interface) names can be used to build custom analyzers, for example:

1
2
3
4
5
Analyzer analyzer = CustomAnalyzer.builder()  
        .withTokenizer("icu")  
        .addTokenFilter("lowercase")  
        .addTokenFilter("asciiFolding")  
        .build();  

This can be much more succinct than using the related classes directly.

The Custom Analyzer

Here is a brief look at the CustomAnalyzer class used in the previous code example.

Some examples of using this class are provided in the JavaDoc.

For example, if you want to add a list of stop words to your analyzer, you can do so in the following ways:

1
2
3
4
.addTokenFilter(StopFilterFactory.NAME,
                "ignoreCase", "false",
                "words", "stopwords.txt",
                "format", "wordset")

Or, by using the StopFilter’s SPI name (which is stop):

1
2
3
4
.addTokenFilter("stop",
                "ignoreCase", "false",
                "words", "stopwords.txt",
                "format", "wordset")

Or by passing in the key/value parameters in a Map:

1
2
3
Map<String, String> stopMap = new HashMap<>();
stopMap.put("words", "stopwords.txt");
stopMap.put("format", "wordset");

And then:

1
.addTokenFilter("stop", stopMap)

Those key/value parameters (e.g. for ignoreCase, words and format) are defined in the related filter factory - so, for example, for the StopFilterFactory, they are documented here.

Don’t forget, in the case of stop words, the file (e.g. stopwords.txt) is expected to be on the classpath - for example, in the default package of your application.

Tokenizers

SPI NameClass Name
classicClassicTokenizerFactory
edgeNGramEdgeNGramTokenizerFactory
icuICUTokenizerFactory
keywordKeywordTokenizerFactory
letterLetterTokenizerFactory
nGramNGramTokenizerFactory
pathHierarchyPathHierarchyTokenizerFactory
patternPatternTokenizerFactory
simplePatternSimplePatternTokenizerFactory
simplePatternSplitSimplePatternSplitTokenizerFactory
standardStandardTokenizerFactory
thaiThaiTokenizerFactory
uax29UrlEmailUAX29URLEmailTokenizerFactory
whitespaceWhitespaceTokenizerFactory
wikipediaWikipediaTokenizerFactory

Char Filters

SPI NameClass Name
htmlStripHTMLStripCharFilterFactory
icuNormalizer2ICUNormalizer2CharFilterFactory
mappingMappingCharFilterFactory
patternReplacePatternReplaceCharFilterFactory
persianPersianCharFilterFactory

Token Filters

SPI NameClass Name
apostropheApostropheFilterFactory
arabicNormalizationArabicNormalizationFilterFactory
arabicStemArabicStemFilterFactory
asciiFoldingASCIIFoldingFilterFactory
bengaliNormalizationBengaliNormalizationFilterFactory
bengaliStemBengaliStemFilterFactory
brazilianStemBrazilianStemFilterFactory
bulgarianStemBulgarianStemFilterFactory
capitalizationCapitalizationFilterFactory
cjkBigramCJKBigramFilterFactory
cjkWidthCJKWidthFilterFactory
classicClassicFilterFactory
codepointCountCodepointCountFilterFactory
commonGramsCommonGramsFilterFactory
commonGramsQueryCommonGramsQueryFilterFactory
concatenateGraphConcatenateGraphFilterFactory
czechStemCzechStemFilterFactory
dateRecognizerDateRecognizerFilterFactory
decimalDigitDecimalDigitFilterFactory
delimitedPayloadDelimitedPayloadTokenFilterFactory
delimitedTermFrequencyDelimitedTermFrequencyTokenFilterFactory
dictionaryCompoundWordDictionaryCompoundWordTokenFilterFactory
edgeNGramEdgeNGramFilterFactory
elisionElisionFilterFactory
englishMinimalStemEnglishMinimalStemFilterFactory
englishPossessiveEnglishPossessiveFilterFactory
fingerprintFingerprintFilterFactory
finnishLightStemFinnishLightStemFilterFactory
fixBrokenOffsetsFixBrokenOffsetsFilterFactory
fixedShingleFixedShingleFilterFactory
flattenGraphFlattenGraphFilterFactory
frenchLightStemFrenchLightStemFilterFactory
frenchMinimalStemFrenchMinimalStemFilterFactory
galicianMinimalStemGalicianMinimalStemFilterFactory
galicianStemGalicianStemFilterFactory
germanLightStemGermanLightStemFilterFactory
germanMinimalStemGermanMinimalStemFilterFactory
germanNormalizationGermanNormalizationFilterFactory
germanStemGermanStemFilterFactory
greekLowercaseGreekLowerCaseFilterFactory
greekStemGreekStemFilterFactory
hindiNormalizationHindiNormalizationFilterFactory
hindiStemHindiStemFilterFactory
hungarianLightStemHungarianLightStemFilterFactory
hunspellStemHunspellStemFilterFactory
hyphenatedWordsHyphenatedWordsFilterFactory
hyphenationCompoundWordHyphenationCompoundWordTokenFilterFactory
icuFoldingICUFoldingFilterFactory
icuNormalizer2ICUNormalizer2FilterFactory
icuTransformICUTransformFilterFactory
indicNormalizationIndicNormalizationFilterFactory
indonesianStemIndonesianStemFilterFactory
irishLowercaseIrishLowerCaseFilterFactory
italianLightStemItalianLightStemFilterFactory
kStemKStemFilterFactory
keepWordKeepWordFilterFactory
keywordMarkerKeywordMarkerFilterFactory
keywordRepeatKeywordRepeatFilterFactory
latvianStemLatvianStemFilterFactory
lengthLengthFilterFactory
limitTokenCountLimitTokenCountFilterFactory
limitTokenOffsetLimitTokenOffsetFilterFactory
limitTokenPositionLimitTokenPositionFilterFactory
lowercaseLowerCaseFilterFactory
minHashMinHashFilterFactory
nGramNGramFilterFactory
norwegianLightStemNorwegianLightStemFilterFactory
norwegianMinimalStemNorwegianMinimalStemFilterFactory
numericPayloadNumericPayloadTokenFilterFactory
patternCaptureGroupPatternCaptureGroupFilterFactory
patternReplacePatternReplaceFilterFactory
persianNormalizationPersianNormalizationFilterFactory
porterStemPorterStemFilterFactory
portugueseLightStemPortugueseLightStemFilterFactory
portugueseMinimalStemPortugueseMinimalStemFilterFactory
portugueseStemPortugueseStemFilterFactory
protectedTermProtectedTermFilterFactory
removeDuplicatesRemoveDuplicatesTokenFilterFactory
reverseStringReverseStringFilterFactory
russianLightStemRussianLightStemFilterFactory
scandinavianFoldingScandinavianFoldingFilterFactory
scandinavianNormalizationScandinavianNormalizationFilterFactory
serbianNormalizationSerbianNormalizationFilterFactory
shingleShingleFilterFactory
snowballPorterSnowballPorterFilterFactory
soraniNormalizationSoraniNormalizationFilterFactory
soraniStemSoraniStemFilterFactory
spanishLightStemSpanishLightStemFilterFactory
spanishMinimalStemSpanishMinimalStemFilterFactory
stemmerOverrideStemmerOverrideFilterFactory
stopStopFilterFactory
swedishLightStemSwedishLightStemFilterFactory
synonymSynonymFilterFactory
synonymGraphSynonymGraphFilterFactory
tokenOffsetPayloadTokenOffsetPayloadTokenFilterFactory
trimTrimFilterFactory
truncateTruncateTokenFilterFactory
turkishLowercaseTurkishLowerCaseFilterFactory
typeTypeTokenFilterFactory
typeAsPayloadTypeAsPayloadTokenFilterFactory
typeAsSynonymTypeAsSynonymFilterFactory
uppercaseUpperCaseFilterFactory
wordDelimiterWordDelimiterFilterFactory
wordDelimiterGraphWordDelimiterGraphFilterFactory

Helper Code to Extract All Names

The raw content of each table was generated by the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import org.apache.lucene.analysis.util.TokenizerFactory;  
import org.apache.lucene.analysis.util.TokenFilterFactory;  
import org.apache.lucene.analysis.util.CharFilterFactory;  
import java.util.Map;  
import java.util.TreeMap;  

public class Main {  

    public static void main(String[] args) {  
        Main main = new Main();  
        main.factoryNamesLister();  
    }  

    private void factoryNamesLister() {  
        Map map = new TreeMap();  
        TokenizerFactory.availableTokenizers().forEach((item) -> {  
            map.put(item, TokenizerFactory.lookupClass(item).getCanonicalName());  
        });  
        printTableRows(map);  

        map.clear();  
        CharFilterFactory.availableCharFilters().forEach((item) -> {  
            map.put(item, CharFilterFactory.lookupClass(item).getCanonicalName());  
        });  
        printTableRows(map);  

        map.clear();  
        TokenFilterFactory.availableTokenFilters().forEach((item) -> {  
            map.put(item, TokenFilterFactory.lookupClass(item).getCanonicalName());  
        });  
        printTableRows(map);  
    }  

    private void printTableRows(Map<String, String> map) {  
        StringBuilder sb = new StringBuilder();  
        map.entrySet().forEach((entry) -> {  
            String item = entry.getKey();  
            String fqClassName = entry.getValue();  
            String className = fqClassName.split("\\.")[fqClassName.split("\\.").length - 1];  
            String url = classNameToUrl(fqClassName);  
            sb.append("<tr><td>").append(item).append("</td><td>")  
                    .append("<a href=\"").append(url).append("\">")  
                    .append(className).append("</a></td></tr>\n");  
        });  
        System.out.println(sb.toString());  
        System.out.println();  
    }  

    private String classNameToUrl(String className) {  
        final String baseUrl = "https://lucene.apache.org/core/8_3_0/";  
        // assume all relevant fully-qualified class names start with this:  
        final String pathStart = "org.apache.lucene.analysis";  
        if (className == null || className.isBlank() || !className.startsWith(pathStart)) {  
            return "";  
        }  

        String[] classParts = className.split("\\.");  
        if (classParts.length < 6) {  
            return "";  
        }  

        // For example, converts this:  
        // "org.apache.lucene.analysis.core.KeywordTokenizerFactory"  
        // to this:  
        // "org/apache/lucene/analysis/core/KeywordTokenizerFactory.html"  
        String classAsPath = String.join("", className.replaceAll("\\.", "/"), ".html");  

        StringBuilder sb = new StringBuilder();  
        sb.append(baseUrl);  

        switch (classParts[4]) {  
            case "icu":  
                sb.append("analyzers-icu/").append(classAsPath);  
                break;  
            default:  
                sb.append("analyzers-common/").append(classAsPath);  
        }  
        return sb.toString();  
    }  
}