elasticsearch query_string处理特殊字符 [英] elasticsearch query_string handle special characters
问题描述
我的数据库与Elasticsearch同步,以优化我们的搜索结果并更快地请求.
My database is sync with an Elasticsearch to optimize our search results and request faster.
我在查询用户时遇到问题,我想用查询热查找我的用户,它可以是名称,电话,ip,...的一部分.
I have an issue querying the users, I want with a query therm look for my users, it can be part of a name, phone, ip, ...
我的实际查询是
query_string: { fields: ['id', 'email', 'firstName', 'lastName', 'phone', 'ip'], query: `*${escapeElastic(req.query.search.toString().toLowerCase())}*`}
其中 req.query.search
是我的搜索和逃避对象,因为我遇到了一些符号问题,所以Elastic来自节点模块 elasticsearch-sanitize
.
Where req.query.search
is my search and escapeElastic comes from the node module elasticsearch-sanitize
because I had issues with some symbols.
例如,如果我查询ipv6,我会遇到一些问题,我将得到 query:'* 2001 \\:0db8 *'
,但它在数据库中找不到任何内容,应该
I have some issue for example if I query for an ipv6, I will have query: '*2001\\:0db8*'
but it will not find anything in the database and it should
如果我的名字为john-doe的人出现其他问题,我的查询将是 query:'* john \\-doe *'
,它将找不到任何结果.
Other issue if I have someone with firstName john-doe my query will be query: '*john\\-doe*'
and it will not find any result.
似乎转义可以防止查询错误,但就我而言还是会造成一些问题.
Seems that the escape prevent query errors but create some issues in my case.
我不知道 query_string
是否是处理我的请求的更好方法,我愿意提出优化此查询的建议
I do not know if query_string
is the better way to do my request, I am open to suggestions to optimize this query
谢谢
推荐答案
I suspect the analyzer on your fields is standard
or similar. This means chars like :
and -
were stripped:
GET _analyze
{
"text": "John-Doe",
"analyzer": "standard"
}
显示
{
"tokens" : [
{
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "doe",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
让我们创建自己的分析器,该分析器将保留特殊字符,但同时将所有其他字符小写:
Let's create our own analyzer which is going to keep the special chars but lowercase them all other chars the same time:
PUT multisearch
{
"settings": {
"analysis": {
"analyzer": {
"with_special_chars": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"firstName": {
"type": "text",
"fields": {
"with_special_chars": {
"type": "text",
"analyzer": "with_special_chars"
}
}
},
"ip": {
"type": "ip",
"fields": {
"with_special_chars": {
"type": "text",
"analyzer": "with_special_chars"
}
}
}
}
}
}
提取2个示例文档:
POST multisearch/_doc
{
"ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334"
}
POST multisearch/_doc
{
"firstName": "John-Doe"
}
并从上方应用查询:
GET multisearch/_search
{
"query": {
"query_string": {
"fields": [
"id",
"email",
"firstName.with_special_chars",
"lastName",
"phone",
"ip.with_special_chars"
],
"query": "2001\\:0db8* OR john-*"
}
}
}
两个匹配都返回.
两条评论:1)请注意,我们正在搜索 .with_special_chars
而不是主要字段,并且2)我从ip中删除了前导通配符-效率极低.
Two remarks: 1) note that we were searching .with_special_chars
instead of the main fields and 2) I've removed the leading wildcard from the ip -- those are highly inefficient.
您提出优化建议以来的最终提示:查询可以改写为
Final tips since you asked for optimization suggestions: the query could be rewritten as
GET multisearch/_search
{
"query": {
"bool": {
"should": [
{
"term": {
"id": "tegO63EBG_KW3EFnvQF8"
}
},
{
"match": {
"email": "john@doe.com"
}
},
{
"match_phrase_prefix": {
"firstName.with_special_chars": "john-d"
}
},
{
"match_phrase_prefix": {
"firstName.with_special_chars": "john-d"
}
},
{
"match": {
"phone.with_special_chars": "+151351"
}
},
{
"wildcard": {
"ip.with_special_chars": {
"value": "2001\\:0db8*"
}
}
}
]
}
}
}
- 部分
id
匹配可能是一个过大的选择-无论term
是否捕获 -
电子邮件
可以简单地匹配
-
first-
&lastName
:我怀疑match_phrase_prefix
的性能要比wildcard
或regexp
更好,所以我会继续这样做(只要因为您不需要开头的*
) -
电话
可以进行match
匹配,但请确保也可以匹配特殊字符(如果使用国际格式) - 对
ip
使用通配符
-与查询字符串中的语法相同
- Partial
id
matching is probably an overkill -- either theterm
catches it or not email
can be simplymatch
edfirst-
&lastName
: I suspectmatch_phrase_prefix
is more performant thanwildcard
orregexp
so I'd go with that (as long as you don't need the leading*
)phone
can bematch
ed but do make sure special chars can be matched too (if you use the int'l format)- use
wildcard
for theip
-- same syntax as in the query string
尝试上面的方法,看看您是否注意到速度有所提高!
Try the above and see if you notice any speed improvements!
这篇关于elasticsearch query_string处理特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!