Lucene NGram Tokenizer与Queryparser [英] Lucene NGram tokenizer with Queryparser
问题描述
我已经为我的项目(NGramTokenizer(Version.LUCENE_44,reader,3,3))创建了用于模糊匹配的自定义三字组分析器-指定令牌大小最小3和最大3
I've created custom trigram analyzer for fuzzy match for my project (NGramTokenizer(Version.LUCENE_44, reader, 3, 3)) -- specifying token size min 3 and max 3
在索引时间内,我获得了正确的三字母组合标记,但是当我在查询时间内(通过QueryParser)使用相同的分析器时,其跳过的标记小于3个字符.
During index time I am getting proper trigram tokens but when I use same analyzer during query time (by QueryParser) its skipping tokens which are less then 3 chars.
示例
索引文件-嗨Rushik
Indexed Document - Hi Rushik
带索引的Tri-gram-hi_,i_r,rus,ush,shi,hik(使用Luke索引读取器进行了检查)
Indexed Tri-grams - hi_, i_r, rus, ush, s hik (checked it using Luke index reader)
查询-嗨Rushik AB XYZ.
Query - Hi Rushik AB XYZ.
已解析的查询(QueryParser结果) (name_data:rus name_data:ush name_data:shi name_data:hik)name_data:xyz
Parsed Query (QueryParser result) (name_data:rus name_data:ush name_data:shi name_data:hik) name_data:xyz
如您所见,查询解析器删除了少于3个字符的令牌. 我了解我在标记过程中指定了3,3,但在这种情况下,索引编制也应该跳过少于3个计数的标记吗?
As you can see, query parser removed tokens which are less then 3 chars. I understand I specified 3,3 during tokenizing but in that case indexing also should've skipped tokens less then 3 count?
我想我在这里缺少什么,有什么帮助吗?
I think I am missing something here, any help?
推荐答案
得到答案..
Lucene QueryParser首先通过空格将数据标记化,然后使用分析器分析各个术语/标记.由于我的分析器是NGram(3,3),因此无法在2个字符的术语/令牌上生成任何令牌.
Lucene QueryParser first tokenize data by White Spaces and then analyze individual terms/tokens with analyzer. As my analyzer is NGram(3,3) it cannot generate any token on term/token of 2 chars.
这篇关于Lucene NGram Tokenizer与Queryparser的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!