如何使用与 Solr 的 n-gram 近似匹配? [英] How to use n-grams approximate matching with Solr?

查看:18
本文介绍了如何使用与 Solr 的 n-gram 近似匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个电影和连续剧数据库,并且由于数据来自许多不同可靠性的来源,我们希望能够对剧集的标题进行模糊字符串匹配.我们在应用程序中使用 Solr 进行搜索,但默认匹配机制在单词级别上运行,这对于短字符串(如标题)来说不够好

We have a database of movies and series, and as the data comes from many sources of varying reliability, we'd like to be able to do fuzzy string matching on the titles of episodes. We are using Solr for search in our application, but the default matching mechanisms operate on word levels, which is not good enough for short strings, like titles

我过去使用过 n-gram 近似匹配,我很高兴发现 Lucene(和 Solr)支持开箱即用的东西.不幸的是,我无法正确配置它.

I had used n-grams approximate matching in the past, and I was very happy to find that Lucene (and Solr) supports something this out of the box. Unfortunately, I haven't been able to configure it correctly.

我认为我需要一个特殊的字段类型,所以我添加了将字段类型添加到我的 schema.xml:

I assumed that I need a special field type for this, so I added the following field-type to my schema.xml:

<fieldType 
   name="trigrams" 
   stored="true" 
   class="solr.StrField"> 
 <analyzer type="index"> 
   <tokenizer 
       class="solr.analysis.NGramTokenizerFactory" 
       minGramSize="3" 
       maxGramSize="5" 
       /> 
   <filter class="solr.LowerCaseFilterFactory"/> 
 </analyzer> 
</fieldType> 

并将架构中的相应字段更改为:

and changed the appropriate field in the schema to:

<field name="title" type="trigrams" 
    indexed="true" stored="true" multiValued="false" /> 

但是,这并没有像我预期的那样工作.查询分析看起来正确,但我没有得到任何结果,这让我相信在索引时间发生了一些事情(即标题被索引为默认字符串字段而不是 trigram 字段).

However, this is not working as I expected. The query analysis looks correctly, but I don't get any results, which makes me believe that something happens at index time (ie. the title is indexed like a default string field instead of trigram field).

我正在尝试的查询类似于

The query I am trying is something like

title:"guy walks into a psychiatrist office"

(有一两个错字)并且应该匹配Guy Walks into a Psychiatrist Office".

(with a typo or two) and it should match "Guy Walks into a Psychiatrist Office".

(我不确定查询是否正确.)

(I am not really sure if the query is correct.)

此外,事实上,我希望能够做更多的事情.我想小写字符串,删除所有标点符号和空格,删除英语停用词,然后将字符串更改为三元组.然而,过滤器仅在字符串被标记后应用...

Moreover, I would like to be able to do something more in fact. I'd like to lowercace the string, remove all punctuation marks and spaces, remove English stopwords and THEN change the string into trigrams. However, the filters are applied only after the string has been tokenized...

提前感谢您的回答.

推荐答案

回答你问题的最后一部分:solr 还有一个 ngram 过滤器.所以你不应该使用 ngram 分词器(但是像WhitespaceTokenizer"这样的),应用所有 pre-ngram 过滤器,然后添加这个:

To answer to the last part of your question: solr has also an ngram filter. So you should not use the ngram tokenizer (but one like "WhitespaceTokenizer" for example), apply all pre-ngram filters and then add this one:

<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="3" />

这篇关于如何使用与 Solr 的 n-gram 近似匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆