elasticsearch query_string处理特殊字符 [英] elasticsearch query_string handle special characters

查看:124
本文介绍了elasticsearch query_string处理特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据库与Elasticsearch同步,以优化我们的搜索结果并更快地请求.

My database is sync with an Elasticsearch to optimize our search results and request faster.

我在查询用户时遇到问题,我想用查询热查找我的用户,它可以是名称,电话,ip,...的一部分.

I have an issue querying the users, I want with a query therm look for my users, it can be part of a name, phone, ip, ...

我的实际查询是

query_string: { fields: ['id', 'email', 'firstName', 'lastName', 'phone', 'ip'], query: `*${escapeElastic(req.query.search.toString().toLowerCase())}*`}

其中 req.query.search 是我的搜索和逃避对象,因为我遇到了一些符号问题,所以Elastic来自节点模块 elasticsearch-sanitize .

Where req.query.search is my search and escapeElastic comes from the node module elasticsearch-sanitize because I had issues with some symbols.

例如,如果我查询ipv6,我会遇到一些问题,我将得到 query:'* 2001 \\:0db8 *',但它在数据库中找不到任何内容,应该

I have some issue for example if I query for an ipv6, I will have query: '*2001\\:0db8*' but it will not find anything in the database and it should

如果我的名字为john-doe的人出现其他问题,我的查询将是 query:'* john \\-doe *',它将找不到任何结果.

Other issue if I have someone with firstName john-doe my query will be query: '*john\\-doe*' and it will not find any result.

似乎转义可以防止查询错误,但就我而言还是会造成一些问题.

Seems that the escape prevent query errors but create some issues in my case.

我不知道 query_string 是否是处理我的请求的更好方法,我愿意提出优化此查询的建议

I do not know if query_string is the better way to do my request, I am open to suggestions to optimize this query

谢谢

推荐答案

我怀疑您所在字段的分析器为

I suspect the analyzer on your fields is standard or similar. This means chars like : and - were stripped:

GET _analyze
{
  "text": "John-Doe",
  "analyzer": "standard"
}

显示

{
  "tokens" : [
    {
      "token" : "john",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "doe",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}


让我们创建自己的分析器,该分析器将保留特殊字符,但同时将所有其他字符小写:


Let's create our own analyzer which is going to keep the special chars but lowercase them all other chars the same time:

PUT multisearch
{
  "settings": {
    "analysis": {
      "analyzer": {
        "with_special_chars": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text",
        "fields": {
          "with_special_chars": {
            "type": "text",
            "analyzer": "with_special_chars"
          }
        }
      },
      "ip": {
        "type": "ip",
        "fields": {
          "with_special_chars": {
            "type": "text",
            "analyzer": "with_special_chars"
          }
        }
      }
    }
  }
}

提取2个示例文档:

POST multisearch/_doc
{
  "ip": "2001:0db8:85a3:0000:0000:8a2e:0370:7334"
}

POST multisearch/_doc
{
   "firstName": "John-Doe"
}

并从上方应用查询:

GET multisearch/_search
{
  "query": {
    "query_string": {
      "fields": [
        "id",
        "email",
        "firstName.with_special_chars",
        "lastName",
        "phone",
        "ip.with_special_chars"
      ],
      "query": "2001\\:0db8* OR john-*"
    }
  }
}

两个匹配都返回.

两条评论:1)请注意,我们正在搜索 .with_special_chars 而不是主要字段,并且2)我从ip中删除了前导通配符-效率极低.

Two remarks: 1) note that we were searching .with_special_chars instead of the main fields and 2) I've removed the leading wildcard from the ip -- those are highly inefficient.

您提出优化建议以来的最终提示:查询可以改写为

Final tips since you asked for optimization suggestions: the query could be rewritten as

GET multisearch/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "id": "tegO63EBG_KW3EFnvQF8"
          }
        },
        {
          "match": {
            "email": "john@doe.com"
          }
        },
        {
          "match_phrase_prefix": {
            "firstName.with_special_chars": "john-d"
          }
        },
        {
          "match_phrase_prefix": {
            "firstName.with_special_chars": "john-d"
          }
        },
        {
          "match": {
            "phone.with_special_chars": "+151351"
          }
        },
        {
          "wildcard": {
            "ip.with_special_chars": {
              "value": "2001\\:0db8*"
            }
          }
        }
      ]
    }
  }
}

  1. 部分 id 匹配可能是一个过大的选择-无论 term 是否捕获
  2. 电子邮件可以简单地匹配
  3. first-& lastName :我怀疑 match_phrase_prefix 的性能要比 wildcard regexp 更好,所以我会继续这样做(只要因为您不需要开头的 * )
  4. 电话可以进行 match 匹配,但请确保也可以匹配特殊字符(如果使用国际格式)
  5. ip 使用通配符-与查询字符串中的语法相同
  1. Partial id matching is probably an overkill -- either the term catches it or not
  2. email can be simply matched
  3. first- & lastName: I suspect match_phrase_prefix is more performant than wildcard or regexp so I'd go with that (as long as you don't need the leading *)
  4. phone can be matched but do make sure special chars can be matched too (if you use the int'l format)
  5. use wildcard for the ip -- same syntax as in the query string

尝试上面的方法,看看您是否注意到速度有所提高!

Try the above and see if you notice any speed improvements!

这篇关于elasticsearch query_string处理特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆