Elasticsearch使用NEST:如何配置分析仪发现部分单词? [英] Elasticsearch using NEST: How to configure analyzers to find partial words?
问题描述
我试图让偏词的搜索,忽略套管和忽视一些字母的加重。是否可以?我想NGRAM默认标记者应该做的伎俩,但我不明白如何与NEST做到这一点。
I am trying to make a search by partial word, ignoring casing and ignoring the accentuation of some letters. Is it possible? I think ngram with default tokenizer should do the trick but i don't understand how to do it with NEST.
例:musiic应与有音乐的记录
Example: "musiic" should match records that have "music"
我使用Elasticsearch的版本为1.9。
The version I am using of Elasticsearch is 1.9.
我做这样的,但它不工作...
I am doing like this but it doesn't work...
var ix = new IndexSettings();
ix.Add("analysis",
@"{
'index_analyzer' : {
'my_index_analyzer' : {
'type' : 'custom',
'tokenizer' : 'standard',
'filter' : ['lowercase', 'mynGram']
}
},
'search_analyzer' : {
'my_search_analyzer' : {
'type' : 'custom',
'tokenizer' : 'standard',
'filter' : ['standard', 'lowercase', 'mynGram']
}
},
'filter' : {
'mynGram' : {
'type' : 'nGram',
'min_gram' : 2,
'max_gram' : 50
}
}
}");
client.CreateIndex("sample", ix);
谢谢,
Thanks,
大卫
推荐答案
简答
我想你正在寻找的是href=\"http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query.html\" rel=\"nofollow\">模糊查询一个 Levenshtein距离的算法匹配类似的话。
I think what you're looking for is a fuzzy query, which uses the Levenshtein distance algorithm to match similar words.
上n元长的答案
NGRAM过滤器把文本的基础上定义的最小/最大范围内许多小令牌。
The nGram filter splits the text into many smaller tokens based on the defined min/max range.
例如,从音乐查询过滤器会发生:
'亩','我们','思',' IC,亩,USI,原文,木斯,USIC'和'音乐'
For example, from your 'music' query the filter will generate:
'mu', 'us', 'si', 'ic', 'mus', 'usi', 'sic', 'musi', 'usic', and 'music'
作为你可以看到 musiic
不符合这些NGRAM令牌。
As you can see musiic
does not match any of these nGram tokens.
为什么n元
n元组的一个好处是,它使通配符查询显著的速度更快,因为所有潜在的子预先生成并在插入时索引(我有看到查询从多秒的加速使用n元)15毫秒。
One benefit of nGrams is that it makes wildcard queries significantly faster because all potential substrings are pre-generated and indexed at insert time (I have seen queries speed up from multi-seconds to 15 milliseconds using nGrams).
如果没有n元,每个字符串必须在查询时搜索匹配〔O(N ^ 2)],而不是在指数[直接抬头O(1)]。由于伪代码:
Without the nGrams, each string must be searched at query time for a match [O(n^2)] instead of directly looked up in the index [O(1)]. As pseudocode:
hits = []
foreach string in index:
if string.substring(query):
hits.add(string)
return hits
VS
return index[query]
注意,这是以使插入更慢,因此需要更多的存储,和较重的内存使用的费用。
Note that this comes at the expense of making inserts slower, requiring more storage, and heavier memory usage.
这篇关于Elasticsearch使用NEST:如何配置分析仪发现部分单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!