如何标记一个没有空格的两个单词相结合的单词 [英] How to token a word which combined by two words without whitespace

查看：196 发布时间：2017/8/7 3:03:18 solr lucene elasticsearch

本文介绍了如何标记一个没有空格的两个单词相结合的单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个单词，如 lovelive ，它通过两个简单的单词 love 和


I have a word like lovelive, which is combined by two simple words love and live without whitespace.
我想知道哪种Lucene Analyzer可以将这种单词标记成两个单独的单词？
I wanna know which kind of Lucene Analyzer can token this kind of words into two separate words?
推荐答案
查看 DictionaryCompoundWordTokenFilter  as  

Have a look at the DictionaryCompoundWordTokenFilter as described in the solr reference
 
 此过滤器使用组件字的字典将复合词分解或分解为单个单词。每个输入令牌都通过不变。如果还可以将其分解为子词，每个子词也会以相同的逻辑位置添加到流中。

  This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.
在：Donaudampfschiff dummkopf
In: "Donaudampfschiff dummkopf"
要过滤的令牌：Donaudampfschiff（1），dummkopf（2），
Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),
 Out：Donaudampfschiff ，Donau（1），dampf（1），schiff（1），dummkopf（2），dumm（2），kopf（2）
Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
正如您在示例配置中可以看到的，您将需要使用要分割的语言的字典，在示例中，他们使用包含要分解的单词的germanwords.txt ，如果找到组成的话。在你的情况下，这将是 love 和 live 。
As you can see in the sample configuration, you will need a dictionary in the language you want to split, in the sample there they use a germanwords.txt that contains the words they want to decompose, if found composed. In your case this would be love and live.
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>

对于Lucene，它是 org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter 。 代码将在github上找到。
For Lucene it is org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter. The code is to be found on github.

                        这篇关于如何标记一个没有空格的两个单词相结合的单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何标记一个没有空格的两个单词相结合的单词 [英] How to token a word which combined by two words without whitespace

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

如何标记一个没有空格的两个单词相结合的单词 [英] How to token a word which combined by two words without whitespace

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭