Solr:使用EdgeNGramFilterFactory进行精确短语查询 [英] Solr: exact phrase query with a EdgeNGramFilterFactory

查看:454
本文介绍了Solr:使用EdgeNGramFilterFactory进行精确短语查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Solr(3.3)中,是否可以通过EdgeNGramFilterFactory逐个字母地搜索字段,并且还对短语查询敏感?

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries?

通过示例,我正在寻找一个字段,如果包含"contrat informatique",则该字段将在用户键入以下内容时找到:

By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:

  • 对比
  • 信息
  • contr
  • 信息
  • 冲突信息"
  • 对比信息"

目前,我做了这样的事情:

Currently, I made something like this:

<fieldtype name="terms" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldtype>

...但是在词组查询中失败.

...but it failed on phrase queries.

当我在solr admin中查看模式分析器时,发现"contrat informatique"生成了以下标记:

When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:

[...] contr contra contrat in inf info infor inform [...]

因此该查询使用"contrat in"(连续标记),而不使用"contrat inf"(因为这两个标记是分开的).

So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).

我敢肯定,任何词干都可以与词组查询一起使用,但是我找不到在EdgeNGramFilterFactory之前要使用的正确的过滤器标记器.

I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory.

推荐答案

由于我无法像Jayendra Patil建议的那样正确地使用PositionFilter(PositionFilter使任何查询成为OR布尔查询),所以我使用了另一种方法

As alas I could not manage to use a PositionFilter right like Jayendra Patil suggested (PositionFilter makes any query a OR boolean query), I used a different approach.

EdgeNGramFilter一样,我添加了一个事实,即用户键入的每个关键字都是强制性的,并且禁用了所有短语.

Still with the EdgeNGramFilter, I added the fact that each keyword the user typed in is mandatory, and disabled all phrases.

因此,如果用户要求输入"cont info",它将转换为+cont +info.真正的短语会更宽容一些,但它可以做到我想要的(并且不会仅从两个词中返回一个词就返回结果).

So if the user ask for "cont info", it transforms to +cont +info. It's a bit more permissive that a true phrase would be, but it managed to do what I want (and doesn't return results with only one term from the two).

唯一解决此问题的方法是可以在结果中对术语进行排列(因此也会找到带有信息冲突"的文档),但这没什么大不了的.

The only con against this workaround is that terms can be permutated in the results (so a document with "informatique contrat" will also be found), but it's not that a big deal.

这篇关于Solr:使用EdgeNGramFilterFactory进行精确短语查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆