如何使用 Elasticsearch 对文本输入执行部分单词搜索? [英] How to perform partial word searches on a text input with Elasticsearch?
问题描述
我有一个查询以下列格式搜索记录:TR000002_1_2020
.
I have a query to search for records in the following format: TR000002_1_2020
.
用户应该能够通过以下方式搜索结果:
Users should be able to search for results the following ways:
TR000002
或 2_1_2020
或 TR000002_1_2020
或 2020
.我使用的是 Elasticsearch 6.8,所以我不能使用 E7 中引入的内置 Search-As-You-Type.因此,我认为 wildcard
搜索或 ngram
可能最适合我的需要.以下是我的两种方法以及它们不起作用的原因.
TR000002
or 2_1_2020
or TR000002_1_2020
or 2020
. I am using Elasticsearch 6.8 so I cannot use the built in Search-As-You-Type introduced in E7. Thus, I figured either wildcard
searches or ngram
may best suit what I needed. Here were my two approaches and why they did not work.
- 通配符
属性映射:
.Text(t => t
.Name(tr => tr.TestRecordId)
)
查询:
m => m.Wildcard(w => w
.Field(tr => tr.TestRecordId)
.Value($"*{form.TestRecordId}*")
),
这有效,但它区分大小写,因此如果用户使用 tr000002_1_2020
进行搜索,则不会返回任何结果(因为 t
和 r
> 在查询中是小写的)
This works but it is case-sensitive so if the user searches with tr000002_1_2020
, then no results would return (since the t
and r
are lowercased in the query)
- ngram(在输入等效项时搜索)
创建自定义 ngram 分析器
Create a custom ngram analyzer
.Analysis(a => a
.Analyzers(aa => aa
.Custom("autocomplete", ca => ca
.Tokenizer("autocomplete")
.Filters(new string[] {
"lowercase"
})
)
.Custom("autocomplete_search", ca => ca
.Tokenizer("lowercase")
)
)
.Tokenizers(t => t
.NGram("autocomplete", e => e
.MinGram(2)
.MaxGram(16)
.TokenChars(new TokenChar[] {
TokenChar.Letter,
TokenChar.Digit,
TokenChar.Punctuation,
TokenChar.Symbol
})
)
)
)
属性映射
.Text(t => t
.Name(tr => tr.TestRecordId)
.Analyzer("autocomplete")
.SearchAnalyzer("autocomplete_search")
)
查询
m => m.Match(m => m
.Query(form.TestRecordId)
),
如在本答案中所述,这不起作用,因为标记器将字符拆分为像 20
和 02
和 2020
,因此我的查询返回了索引中包含 2020 的所有文档,例如 TR000002_1_2020
和 TR000008_1_2020
和 TR000003_6_2020
.
As described in this answer, this does not work since the tokenizer splits the characters up in to elements like 20
and 02
and 2020
, so as a result my queries returned all documents in my index that contained 2020 such as TR000002_1_2020
and TR000008_1_2020
and TR000003_6_2020
.
如何最好地利用 Elasticsearch 来实现我想要的搜索行为?我也看到了 query string
的使用.谢谢!
What's the best utilization of Elasticsearch to allow my desired search behavior? I've seen query string
used as well. Thanks!
推荐答案
这里有一个简单的方法来满足您的要求(我希望如此).
here is a simple way to address your requirements ( I hope ).
- 我们使用模式替换字符过滤器来移除引用的固定部分 (TR000...)
- 我们使用拆分标记器来拆分_"上的引用.字符
- 我们使用 matchPhrase 查询来确保引用的片段按顺序匹配
使用此分析链作为参考 TR000002_1_2020
我们得到标记 [2", 1", 2020";]代码>.所以它会匹配查询
[TR000002_1_2020"、TR000002 1 2020"、2_1_2020"、1_2020"]
,但它不会匹配3_1_20或
2_2_2020
.
with this analysis chain for reference TR000002_1_2020
we get the tokens ["2", "1", "2020" ]
. So it will matches the queries ["TR000002_1_2020", "TR000002 1 2020", "2_1_2020", "1_2020"]
, but it will not match 3_1_2020
or 2_2_2020
.
这是一个映射和分析的例子.它不在 Nest 中,但我认为您可以进行翻译.
Here is an example of mapping and analysis. It's not in Nest but I think you will be able to make the translation.
PUT pattern_split_demo
{
"settings": {
"analysis": {
"char_filter": {
"replace_char_filter": {
"type": "pattern_replace",
"pattern": "^TR0*",
"replacement": ""
}
},
"tokenizer": {
"split_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
},
"analyzer": {
"split_analyzer": {
"tokenizer": "split_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"replace_char_filter"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "split_analyzer"
}
}
}
}
POST pattern_split_demo/_analyze
{
"text": "TR000002_1_2020",
"analyzer": "split_analyzer"
} => ["2", "1", "2020"]
POST pattern_split_demo/_doc?refresh=true
{
"content": "TR000002_1_2020"
}
POST pattern_split_demo/_search
{
"query": {
"match_phrase": {
"content": "TR000002_1_2020"
}
}
}
这篇关于如何使用 Elasticsearch 对文本输入执行部分单词搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!