如何标记一个没有空格的两个单词相结合的单词 [英] How to token a word which combined by two words without whitespace
问题描述
我有一个单词,如 lovelive
,它通过两个简单的单词 love
和
I have a word like lovelive
, which is combined by two simple words love
and live
without whitespace.
我想知道哪种Lucene Analyzer可以将这种单词标记成两个单独的单词?
I wanna know which kind of Lucene Analyzer can token this kind of words into two separate words?
推荐答案
查看 DictionaryCompoundWordTokenFilter
as
Have a look at the DictionaryCompoundWordTokenFilter
as described in the solr reference
此过滤器使用组件字的字典将复合词分解或分解为单个单词。每个输入令牌都通过不变。如果还可以将其分解为子词,每个子词也会以相同的逻辑位置添加到流中。
This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.
在:Donaudampfschiff dummkopf
In: "Donaudampfschiff dummkopf"
要过滤的令牌:Donaudampfschiff(1),dummkopf(2),
Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),
Out:Donaudampfschiff ,Donau(1),dampf(1),schiff(1),dummkopf(2),dumm(2),kopf(2)
Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
正如您在示例配置中可以看到的,您将需要使用要分割的语言的字典,在示例中,他们使用包含要分解的单词的germanwords.txt
,如果找到组成的话。在你的情况下,这将是 love
和 live
。
As you can see in the sample configuration, you will need a dictionary in the language you want to split, in the sample there they use a germanwords.txt
that contains the words they want to decompose, if found composed. In your case this would be love
and live
.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>
对于Lucene,它是 org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
。 代码将在github上找到。
For Lucene it is org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
. The code is to be found on github.
这篇关于如何标记一个没有空格的两个单词相结合的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!