SPI Names in Lucene 8.3.0
The below SPI (service provider interface) names can be used to build custom analyzers, for example:
Java
1
2
3
4
5
|
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("asciiFolding")
.build();
|
This can be much more succinct than using the related classes directly.
The Custom Analyzer
Here is a brief look at the CustomAnalyzer
class used in the previous code example.
Some examples of using this class are provided in the JavaDoc.
For example, if you want to add a list of stop words to your analyzer, you can do so in the following ways:
Java
1
2
3
4
|
.addTokenFilter(StopFilterFactory.NAME,
"ignoreCase", "false",
"words", "stopwords.txt",
"format", "wordset")
|
Or, by using the StopFilter
’s SPI name (which is stop
):
Java
1
2
3
4
|
.addTokenFilter("stop",
"ignoreCase", "false",
"words", "stopwords.txt",
"format", "wordset")
|
Or by passing in the key/value parameters in a Map
:
Java
1
2
3
|
Map<String, String> stopMap = new HashMap<>();
stopMap.put("words", "stopwords.txt");
stopMap.put("format", "wordset");
|
And then:
Java
1
|
.addTokenFilter("stop", stopMap)
|
Those key/value parameters (e.g. for ignoreCase
, words
and format
) are defined in the related filter factory - so, for example, for the StopFilterFactory
, they are documented here.
Don’t forget, in the case of stop words, the file (e.g. stopwords.txt
) is expected to be on the classpath - for example, in the default package of your application.
Tokenizers
Char Filters
Token Filters
The raw content of each table was generated by the following code:
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import org.apache.lucene.analysis.util.CharFilterFactory;
import java.util.Map;
import java.util.TreeMap;
public class Main {
public static void main(String[] args) {
Main main = new Main();
main.factoryNamesLister();
}
private void factoryNamesLister() {
Map map = new TreeMap();
TokenizerFactory.availableTokenizers().forEach((item) -> {
map.put(item, TokenizerFactory.lookupClass(item).getCanonicalName());
});
printTableRows(map);
map.clear();
CharFilterFactory.availableCharFilters().forEach((item) -> {
map.put(item, CharFilterFactory.lookupClass(item).getCanonicalName());
});
printTableRows(map);
map.clear();
TokenFilterFactory.availableTokenFilters().forEach((item) -> {
map.put(item, TokenFilterFactory.lookupClass(item).getCanonicalName());
});
printTableRows(map);
}
private void printTableRows(Map<String, String> map) {
StringBuilder sb = new StringBuilder();
map.entrySet().forEach((entry) -> {
String item = entry.getKey();
String fqClassName = entry.getValue();
String className = fqClassName.split("\\.")[fqClassName.split("\\.").length - 1];
String url = classNameToUrl(fqClassName);
sb.append("<tr><td>").append(item).append("</td><td>")
.append("<a href=\"").append(url).append("\">")
.append(className).append("</a></td></tr>\n");
});
System.out.println(sb.toString());
System.out.println();
}
private String classNameToUrl(String className) {
final String baseUrl = "https://lucene.apache.org/core/8_3_0/";
// assume all relevant fully-qualified class names start with this:
final String pathStart = "org.apache.lucene.analysis";
if (className == null || className.isBlank() || !className.startsWith(pathStart)) {
return "";
}
String[] classParts = className.split("\\.");
if (classParts.length < 6) {
return "";
}
// For example, converts this:
// "org.apache.lucene.analysis.core.KeywordTokenizerFactory"
// to this:
// "org/apache/lucene/analysis/core/KeywordTokenizerFactory.html"
String classAsPath = String.join("", className.replaceAll("\\.", "/"), ".html");
StringBuilder sb = new StringBuilder();
sb.append(baseUrl);
switch (classParts[4]) {
case "icu":
sb.append("analyzers-icu/").append(classAsPath);
break;
default:
sb.append("analyzers-common/").append(classAsPath);
}
return sb.toString();
}
}
|