我可以使用哪些Solr标记器和过滤器进行强大的一般站点搜索? [英] What Solr tokenizer and filters can I use for a strong general site search?

查看:106
本文介绍了我可以使用哪些Solr标记器和过滤器进行强大的一般站点搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过搜索 ibm 来确保搜索 I.B.M。。我还想确保通过搜索 dismember 可以找到 Dismemberment Plan



使用Solr,我可以在分析和查询时使用哪些标记器和过滤器来允许这两种结果?

解决方案

对于IBM => ibm

,您需要一个solr。 WordDelimiterFilterFactory ,它可以去掉特殊字符并链接单词和数字。

catenateWords =1会连接单词和将IBM转换为IBM。



Dismemberment => dismember

需要包含一个词干过滤器(例如solr.PorterStemFilterFactory,solr.EnglishMinimalStemFilterFactory),它将索引词语的根源并为具有相同根的单词提供匹配。



另外,您可以将solr.LowerCaseFilterFactory用于不区分大小写的匹配项(IBM和ibm),solr.ASCIIFoldingFilterFactory用于处理外来字符。



您总是可以使用 SynonymFilterFactory 来映射您认为的单词是同义词。



您可以在查询和索引时间应用它,以便它们在两者期间匹配并转换,并且结果是一致的。

例如字段类型def -

 < fieldType name =text_en_splittingclass =solr.TextFieldpositionIncrementGap =100autoGeneratePhraseQueries = 真 > 
<! - 索引和查询时间 - >
< analyzer type =index>
< tokenizer class =solr.WhitespaceTokenizerFactory/>
< filter class =solr.WordDelimiterFilterFactorygenerateWordParts =1generateNumberParts =1catenateWords =1catenateNumbers =1catenateAll =0splitOnCaseChange =1/>
< filter class =solr.LowerCaseFilterFactory/>
<! - Stemmer - >
< filter class =solr.PorterStemFilterFactory/>
< / analyzer>
< analyzer type =query>
< tokenizer class =solr.WhitespaceTokenizerFactory/>
< filter class =solr.WordDelimiterFilterFactorygenerateWordParts =1generateNumberParts =1catenateWords =0catenateNumbers =0catenateAll =0splitOnCaseChange =1/>
< filter class =solr.LowerCaseFilterFactory/>
< filter class =solr.PorterStemFilterFactory/>
< / analyzer>
< / fieldType>

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


I'd like to ensure that searching for, say, I.B.M. can be found by searching for ibm. I'd also like to make sure that Dismemberment Plan could be found by searching for dismember.

Using Solr, what tokenizer and filters can I use in analysis and query time to permit both kinds of results?

解决方案

For I.B.M. => ibm
you would need a solr.WordDelimiterFilterFactory, which would strip special chars and catenate word and numbers

catenateWords="1" would catenate the words and transform I.B.M to IBM.

Dismemberment => dismember
Need to include a stemmer filter (e.g. solr.PorterStemFilterFactory, solr.EnglishMinimalStemFilterFactory) which would index the roots of the words and provide matches for words which have the same roots.

In addition you can use solr.LowerCaseFilterFactory for case insensitive matches (IBM and ibm), solr.ASCIIFoldingFilterFactory for handling foreign characters.

You can always use SynonymFilterFactory to map words which you think are synonyms.

you can apply this at both query and index time, so that they match and convert during both and the results are consistent.

e.g. field type def -

<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <!-- Index and Query time -->
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
        <!-- Stemmer -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

这篇关于我可以使用哪些Solr标记器和过滤器进行强大的一般站点搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆