Lucene NGram Tokenizer与Queryparser [英] Lucene NGram tokenizer with Queryparser

查看:86
本文介绍了Lucene NGram Tokenizer与Queryparser的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经为我的项目(NGramTokenizer(Version.LUCENE_44,reader,3,3))创建了用于模糊匹配的自定义三字组分析器-指定令牌大小最小3和最大3

I've created custom trigram analyzer for fuzzy match for my project (NGramTokenizer(Version.LUCENE_44, reader, 3, 3)) -- specifying token size min 3 and max 3

在索引时间内,我获得了正确的三字母组合标记,但是当我在查询时间内(通过QueryParser)使用相同的分析器时,其跳过的标记小于3个字符.

During index time I am getting proper trigram tokens but when I use same analyzer during query time (by QueryParser) its skipping tokens which are less then 3 chars.

示例

索引文件-嗨Rushik

Indexed Document - Hi Rushik

带索引的Tri-gram-hi_,i_r,rus,ush,shi,hik(使用Luke索引读取器进行了检查)

Indexed Tri-grams - hi_, i_r, rus, ush, s hik (checked it using Luke index reader)

查询-嗨Rushik AB XYZ.

Query - Hi Rushik AB XYZ.

已解析的查询(QueryParser结果) (name_data:rus name_data:ush name_data:shi name_data:hik)name_data:xyz

Parsed Query (QueryParser result) (name_data:rus name_data:ush name_data:shi name_data:hik) name_data:xyz

如您所见,查询解析器删除了少于3个字符的令牌. 我了解我在标记过程中指定了3,3,但在这种情况下,索引编制也应该跳过少于3个计数的标记吗?

As you can see, query parser removed tokens which are less then 3 chars. I understand I specified 3,3 during tokenizing but in that case indexing also should've skipped tokens less then 3 count?

我想我在这里缺少什么,有什么帮助吗?

I think I am missing something here, any help?

推荐答案

得到答案..

Lucene QueryParser首先通过空格将数据标记化,然后使用分析器分析各个术语/标记.由于我的分析器是NGram(3,3),因此无法在2个字符的术语/令牌上生成任何令牌.

Lucene QueryParser first tokenize data by White Spaces and then analyze individual terms/tokens with analyzer. As my analyzer is NGram(3,3) it cannot generate any token on term/token of 2 chars.

这篇关于Lucene NGram Tokenizer与Queryparser的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆