构建 Lucene 同义词 [英] Build Lucene Synonyms
问题描述
我有以下代码
static class TaggerAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("al"), new CharsRef("americanleague"), true);
builder.add(new CharsRef("al"), new CharsRef("a.l."), true);
builder.add(new CharsRef("nba"), new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), true);
SynonymMap mySynonymMap = null;
try {
mySynonymMap = builder.build();
} catch (IOException e) {
e.printStackTrace();
}
Tokenizer source = new ClassicTokenizer(Version.LUCENE_40, reader);
TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new SynonymFilter(filter, mySynonymMap, true);
return new TokenStreamComponents(source, filter);
}
}
我正在运行一些测试,到目前为止,一切正常,直到我弄清楚了这个场景.
And I'm running some test, so far, everything went ok until I figured out this scenario.
String title = "Very short title at a.l. bla bla"
Assert.assertTrue(TagUtil.evaluate(memoryIndex,"americanleague"));
Assert.assertTrue(TagUtil.evaluate(memoryIndex,"al"));
我期待这两个案例都能成功运行,但美国联盟与a.l."不匹配除了a.l."和americanleague"是al"的同义词.
I was expecting that both cases ran successfully, but americanleague didn't match with "a.l." besides both "a.l." and "americanleague" are "al" synonyms.
那么,我该怎么办?我不想将所有组合添加到地图中.谢谢
So, what do I do? I don't want to add all combinations to the Map. Thanks
推荐答案
我相信你对 builder.add
的论点倒退了.试试:
I believe you have your arguments to builder.add
backwards. Try:
builder.add(new CharsRef("americanleague"), new CharsRef("al"), true);
builder.add(new CharsRef("a.l."), new CharsRef("al"), true);
builder.add(new CharsRef("national" + SynonymMap.WORD_SEPARATOR + "basketball" + SynonymMap.WORD_SEPARATOR + "association"), new CharsRef("nba"), true);
SynonymFilter
只是从第一个 arg(输入)映射到第二个 arg(输出),而不是相反.所以你有规则将al"翻译成两个不同的同义词,但没有对a.l."的输入做任何事情.或美国联赛".
The SynonymFilter
just maps from the first arg (input) to the second arg (output), rather than the other way around. So you have rules to translate "al" to two different synonyms, but none that do anything to inputs of "a.l." or "americanleague".
这篇关于构建 Lucene 同义词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!