Solr中的DoubleMetaphoneFilterFactory [英] DoubleMetaphoneFilterFactory in Solr

查看:79
本文介绍了Solr中的DoubleMetaphoneFilterFactory的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目的是集成solr,以便从我的应用程序返回的结果准确,快速.我正在使用doublemetaphonic对名称字段执行搜索,以便听起来也相似的名称也被捕获,然后使用模糊搜索(使用levenshtein距离算法)获取超过一定百分比的结果.问题是当我将doublemetaphonic放在字段上时类型名称,那么我将无法对该字段执行模糊搜索.

My purpose is to integrate solr so that the results returned from my application are accurate and fast. I am performing the search over name field using doublemetaphonic so that the names that sound similar are also captured then using the fuzzy search(That uses levenshtein distance algorithm) fetch the results above certain percentage.The problem is when I put the doublemetaphonic on the feild type name then I am unable to perform fuzzy search over that field.

我的schema.xml中的示例配置如下:

The example configuration from my schema.xml looks like:

<field name="sdn_names" type="doublemetaphonetic" indexed="true" stored="true"     termVectors="true"/>
<!--Defination of doublemetaphonic.-->
<fieldtype name="doublemetaphonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
  </analyzer>
</fieldtype>

从solr UI中,当我尝试搜索sdn_names:abdul〜0.50时,它返回0个结果,如果我将查询字符串更改为sdn_names:abdul,则在结果集中获得180条记录.我过去一直在寻找解决方案,发现当我们使用双元音进行索引时,语音值不同于原始值,并且两个字符串之间的levenshtein距离计算值非常大,因此结果为0.请提供任何链接或建议的解决方案/问题的阅读方法,因为我是Solr的新手. 预先感谢

From my solr UI when I tried to search sdn_names:abdul~0.50 then it returns 0 results and if I change my query String to sdn_names:abdul then i get 180 records in the resultset. I used to search over for the solution and found that when we use the doublemetaphonic for indexing then the phonetic value is different from the orignal value and the levenshtein distance calculated is very large between two strings so the results are 0. Please provide me any links or recommanded solution/reading for the problem as i am new to solr. Thanks in advance

推荐答案

元音和通配符不兼容.

首先,Lucene不会使用通配符,模糊匹配,正则表达式等来分析术语.因此,您正在尝试根据元音素代码搜索纯文本.因此,您有:

Firstly, Lucene does not analyze terms with wildcards, fuzzy matching, regex, etc. As such, you are trying to search plain text against metaphone codes. So, you have:

  • 索引:APTL
  • 查询中:abdul〜0.5

我想更清楚地说明为什么您没有获得任何比赛.这是3的levenshtein距离,这是相当大的.

Which I think makes it more obvious why you don't get any matches. That's a levenshtein distance of 3, which is considerable.

用通配符混合元音素没有多大意义.有效的metaphone匹配应为精确匹配.变音器算法将术语简化为代表前四个声音的代码(有所简化).

Mixing metaphone with wildcards doesn't make a great deal of sense. A valid metaphone match should be an exact match. The metaphone algorithm reduces the term to a code representing is first four sounds (simplifying somewhat).

这是搜索相关的较宽松结果的两种不同且独立的方法.它们应保持分开,因此,如果您希望能够同时在模糊匹配和元音素上进行搜索,则最好的办法是在两个不同的字段中对元音素和全文本进行索引,然后对它们进行搜索.像这样:

These are two different and separate methods of searching for relevant looser results. They should be kept separate, so if you want to be able to search on both fuzzy matching and metaphone, the best idea would be to index the metaphones and full text in two different fields, and then search on both of them. Something like:

<field name="sdn_names_phonetic" type="doublemetaphonetic" indexed="true" stored="false" termVectors="true"/>
<field name="sdn_names" type="text_standard" indexed="true" stored="true" termVectors="true"/>

<fieldType name="text_standard" class="solr.TextField"> 
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> 
</fieldType> 
<fieldtype name="doublemetaphonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
  </analyzer>
</fieldtype>

(注意:我已将您的metaphone字段更改为stored=false,因为这两个字段都将存储相同的数据,因此无需存储它们.)

(Note: I've changes your metaphone fields to stored=false, since both of these fields would store the same data, there is no need to store both of them).

可以像这样搜索:

sdn_names:abdul~0.5 sdn_names_phonetic:abdul

请参阅solr文档部分:为多个字段中的相同数据编制索引,有关这种模式的更多信息.

See the solr documentation section: Indexing same data in multiple fields, for a bit more about this sort of pattern.

这篇关于Solr中的DoubleMetaphoneFilterFactory的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆