Solr:使用 EdgeNGramFilterFactory 的精确短语查询 [英] Solr: exact phrase query with a EdgeNGramFilterFactory

查看:25
本文介绍了Solr:使用 EdgeNGramFilterFactory 的精确短语查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Solr (3.3) 中,是否可以通过 EdgeNGramFilterFactory 逐个字母搜索字段并且对短语查询敏感?

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries?

例如,我正在寻找一个字段,如果包含contrat informatique",则会在用户键入时找到该字段:

By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:

  • 对比
  • 信息
  • 控制
  • 信息
  • contrat informatique"
  • 合同信息"

目前,我做了这样的事情:

Currently, I made something like this:

<fieldtype name="terms" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldtype>

...但它在短语查询上失败了.

...but it failed on phrase queries.

当我查看 solr admin 中的模式分析器时,我发现contrat informatique"生成了以下标记:

When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:

[...] contr contra contrat in inf info infor inform [...]

因此查询适用于contrat in"(连续标记),而不适用于contrat inf"(因为这两个标记是分开的).

So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).

我很确定任何类型的词干提取都可以用于短语查询,但是我在 EdgeNGramFilterFactory 之前找不到要使用的正确的过滤器标记器.

I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory.

推荐答案

可惜我无法像 Jayendra Patil 建议的那样使用 PositionFilter(PositionFilter 使任何查询成为 OR 布尔查询),我使用了不同的方法.

As alas I could not manage to use a PositionFilter right like Jayendra Patil suggested (PositionFilter makes any query a OR boolean query), I used a different approach.

仍然使用 EdgeNGramFilter,我添加了一个事实,即用户输入的每个关键字都是必需的,并禁用了所有短语.

Still with the EdgeNGramFilter, I added the fact that each keyword the user typed in is mandatory, and disabled all phrases.

因此,如果用户要求 "cont info",它会转换为 +cont +info.一个真正的短语会更宽容一点,但它设法做到了我想要的(并且不会只返回两个词中的一个词的结果).

So if the user ask for "cont info", it transforms to +cont +info. It's a bit more permissive that a true phrase would be, but it managed to do what I want (and doesn't return results with only one term from the two).

反对这种解决方法的唯一缺点是可以在结果中排​​列术语(因此还会找到带有informatique contrat"的文档),但这没什么大不了的.

The only con against this workaround is that terms can be permutated in the results (so a document with "informatique contrat" will also be found), but it's not that a big deal.

这篇关于Solr:使用 EdgeNGramFilterFactory 的精确短语查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆