Elasticsearch使用NEST:如何配置分析仪发现部分单词? [英] Elasticsearch using NEST: How to configure analyzers to find partial words?

查看:526
本文介绍了Elasticsearch使用NEST:如何配置分析仪发现部分单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图让偏词的搜索,忽略套管和忽视一些字母的加重。是否可以?我想NGRAM默认标记者应该做的伎俩,但我不明白如何与NEST做到这一点。

I am trying to make a search by partial word, ignoring casing and ignoring the accentuation of some letters. Is it possible? I think ngram with default tokenizer should do the trick but i don't understand how to do it with NEST.

例:musiic应与有音乐的记录

Example: "musiic" should match records that have "music"

我使用Elasticsearch的版本为1.9。

The version I am using of Elasticsearch is 1.9.

我做这样的,但它不工作...

I am doing like this but it doesn't work...

var ix = new IndexSettings();
        ix.Add("analysis",
            @"{
               'index_analyzer' : {
                          'my_index_analyzer' : {
                                        'type' : 'custom',
                                        'tokenizer' : 'standard',
                                        'filter' : ['lowercase', 'mynGram']
                          }
               },
               'search_analyzer' : {
                          'my_search_analyzer' : {
                                        'type' : 'custom',
                                        'tokenizer' : 'standard',
                                        'filter' : ['standard', 'lowercase', 'mynGram']
                          }
               },
               'filter' : {
                        'mynGram' : {
                                   'type' : 'nGram',
                                   'min_gram' : 2,
                                   'max_gram' : 50
                        }
               }
    }");
        client.CreateIndex("sample", ix);



谢谢,

Thanks,

大卫

推荐答案

简答

我想你正在寻找的是href=\"http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query.html\" rel=\"nofollow\">模糊查询一个 Levenshtein距离的算法匹配类似的话。

I think what you're looking for is a fuzzy query, which uses the Levenshtein distance algorithm to match similar words.

上n元长的答案

NGRAM过滤器把文本的基础上定义的最小/最大范围内许多小令牌。

The nGram filter splits the text into many smaller tokens based on the defined min/max range.

例如,从音乐查询过滤器会发生:
'亩','我们','思',' IC,亩,USI,原文,木斯,USIC'和'音乐'

For example, from your 'music' query the filter will generate: 'mu', 'us', 'si', 'ic', 'mus', 'usi', 'sic', 'musi', 'usic', and 'music'

作为你可以看到 musiic 不符合这些NGRAM令牌。

As you can see musiic does not match any of these nGram tokens.

为什么n元

n元组的一个好处是,它使通配符查询显著的速度更快,因为所有潜在的子预先生成并在插入时索引(我有看到查询从多秒的加速使用n元)15毫秒。

One benefit of nGrams is that it makes wildcard queries significantly faster because all potential substrings are pre-generated and indexed at insert time (I have seen queries speed up from multi-seconds to 15 milliseconds using nGrams).

如果没有n元,每个字符串必须在查询时搜索匹配〔O(N ^ 2)],而不是在指数[直接抬头O(1)]。由于伪代码:

Without the nGrams, each string must be searched at query time for a match [O(n^2)] instead of directly looked up in the index [O(1)]. As pseudocode:

hits = []
foreach string in index:
    if string.substring(query):
        hits.add(string)
return hits

VS

return index[query]

注意,这是以使插入更慢,因此需要更多的存储,和较重的内存使用的费用。

Note that this comes at the expense of making inserts slower, requiring more storage, and heavier memory usage.

这篇关于Elasticsearch使用NEST:如何配置分析仪发现部分单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆