如何在弹性搜索中用可扩展的前缀和后缀匹配短语? [英] How to match a phrase in elastic-search with expandable prefix and suffix?
问题描述
我们有一个用例,我们希望在弹性搜索中匹配词组,但是除了词组查询之外,我们还想搜索部分词组。
We have a use case in which we want to match phrases in elastic-search, but in addition to phrase query we also want to search partial phrases.
示例:
搜索短语: welcome you或 lcome you或 welcome yo或 lcome yo,这应与包含短语的文档匹配:
Search phrase: "welcome you" or "lcome you" or "welcome yo" or "lcome yo" this should match to documents containing phrases:
欢迎您
我们欢迎您
欢迎您使用
我们欢迎您使用
ie我们希望通过执行短语查询来维护单词的顺序,该短语查询具有添加的功能,该功能返回给我们结果,其中包含作为部分子字符串的短语,并且其前缀和后缀可扩展到某些可配置的长度。
在弹性中,我发现了类似的内容' match_phrase_prefix ',但是
i.e. we want to maintain the ordering of words by doing a phrase query with added functionality that is returns us results which contains phrase as a partial substring and with prefix and suffix expandable to certain configurable length. In elastic I found something similar 'match_phrase_prefix' but it only match phrases which starts with a particular prefix.
Ex返回结果以d前缀开头:
Ex return results starting with d prefix:
$ curl -XGET localhost:9200/startswith/test/_search?pretty -d '{
"query": {
"match_phrase_prefix": {
"title": {
"query": "d",
"max_expansions": 5
}
}
}
}'
还有什么方法可以实现后缀吗?
Is there any way that I could achieve this for suffix as well ?
推荐答案
我强烈建议您研究 shingle
令牌过滤器。
I would strongly encourage you to look into the shingle
token filter.
您可以定义一个索引机智定制分析器,它利用带状疱状图以将令牌本身之外的一组后续令牌索引在一起。
You can define an index with a custom analyzer that leverages shingles in order to index a set of subsequent tokens together in addition to the tokens themselves.
curl -XPUT localhost:9200/startswith -d '{
"settings": {
"analysis": {
"analyzer": {
"my_shingles": {
"tokenizer": "standard",
"filter": [
"lowercase",
"shingles"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": true
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_shingles"
}
}
}
}
}'
例如,我们欢迎您
被索引为以下令牌
For instance, we welcome you to
would be indexed as the following tokens
-
我们
-
我们欢迎
-
欢迎
-
欢迎您
-
您
-
您要
-
至
we
we welcome
welcome
welcome you
you
you to
to
然后您可以索引一些示例文档:
Then you can index a few sample documents:
curl -XPUT localhost:9200/startswith/test/_bulk -d '
{"index": {}}
{"title": "welcome you"}
{"index": {}}
{"title": "we welcome you"}
{"index": {}}
{"title": "welcome you to"}
{"index": {}}
{"title": "we welcome you to"}
'
最后,您可以运行以下查询来匹配上面的所有四个文档,像这样:
Finally, you can run the following query to match all four documents above, like this:
curl -XPOST localhost:9200/startswith/test/_search -d '{
"query": {
"match": {"title": "welcome you"}
}
}'
请注意,此方法比功能强大match_phrase_prefix
查询,因为它允许您在文本正文中的任何位置(无论是开头还是结尾)匹配后续标记。
Note that this approach is more powerful than the match_phrase_prefix
query, because it allows you to match subsequent tokens anywhere in your body of text, whether at the beginning or the end.
这篇关于如何在弹性搜索中用可扩展的前缀和后缀匹配短语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!