Solr:使用EdgeNGramFilterFactory进行精确短语查询 [英] Solr: exact phrase query with a EdgeNGramFilterFactory
问题描述
在Solr(3.3)中,是否可以通过EdgeNGramFilterFactory
逐个字母地搜索字段,并且还对短语查询敏感?
In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory
and also sensitive to phrase queries?
通过示例,我正在寻找一个字段,如果包含"contrat informatique",则该字段将在用户键入以下内容时找到:
By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:
- 对比
- 信息
- contr
- 信息
- 冲突信息"
- 对比信息"
目前,我做了这样的事情:
Currently, I made something like this:
<fieldtype name="terms" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldtype>
...但是在词组查询中失败.
...but it failed on phrase queries.
当我在solr admin中查看模式分析器时,发现"contrat informatique"生成了以下标记:
When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:
[...] contr contra contrat in inf info infor inform [...]
因此该查询使用"contrat in"(连续标记),而不使用"contrat inf"(因为这两个标记是分开的).
So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).
我敢肯定,任何词干都可以与词组查询一起使用,但是我找不到在EdgeNGramFilterFactory
之前要使用的正确的过滤器标记器.
I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory
.
推荐答案
由于我无法像Jayendra Patil建议的那样正确地使用PositionFilter
(PositionFilter使任何查询成为OR布尔查询),所以我使用了另一种方法
As alas I could not manage to use a PositionFilter
right like Jayendra Patil suggested (PositionFilter makes any query a OR boolean query), I used a different approach.
与EdgeNGramFilter
一样,我添加了一个事实,即用户键入的每个关键字都是强制性的,并且禁用了所有短语.
Still with the EdgeNGramFilter
, I added the fact that each keyword the user typed in is mandatory, and disabled all phrases.
因此,如果用户要求输入"cont info"
,它将转换为+cont +info
.真正的短语会更宽容一些,但它可以做到我想要的(并且不会仅从两个词中返回一个词就返回结果).
So if the user ask for "cont info"
, it transforms to +cont +info
. It's a bit more permissive that a true phrase would be, but it managed to do what I want (and doesn't return results with only one term from the two).
唯一解决此问题的方法是可以在结果中对术语进行排列(因此也会找到带有信息冲突"的文档),但这没什么大不了的.
The only con against this workaround is that terms can be permutated in the results (so a document with "informatique contrat" will also be found), but it's not that a big deal.
这篇关于Solr:使用EdgeNGramFilterFactory进行精确短语查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!