负前瞻Regexp在ES DSL查询中不起作用 [英] negative lookahead Regexp doesnt work in ES dsl query

查看:406
本文介绍了负前瞻Regexp在ES DSL查询中不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Elastic搜索的映射如下:

{
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "node": {
      "properties": {
        "field1": {
          "type": "keyword"
        },
        "field2": {
          "type": "keyword"
        },
        "query": {
          "properties": {
            "regexp": {
              "properties": {
                "field1": {
                  "type": "keyword"
                },
                "field2": {
                  "type": "keyword"
                }
              }
            }
          }
        }
      }
    }
  }
}

问题是:

我正在使用elasticsearch_dsl Q()形成ES查询.当我的查询包含任何复杂的正则表达式时,它在大多数情况下都可以正常工作.但是,如果它包含正则表达式字符!",则它将完全失败.在里面.当搜索字词包含!"时,不会给出任何结果在里面.

例如:

1.)Q('regexp', field1 = "^[a-z]{3}.b.*")(完美运行)

2.)Q('regexp', field1 = "^f04.*")(完美运行)

3.)Q('regexp', field1 = "f00.*")(完美运行)

4.)Q('regexp', field1 = "f04baz?")(完美运行)

在以下情况下失败:

5.)Q('regexp', field1 = "f04((?!z).)*")(完全没有结果失败)

我尝试在字段中如上所述添加"analyzer":关键字"以及"type":关键字",但是在这种情况下没有任何作用.

在浏览器中,我尝试检查Analyzer:keyword在失败的情况下如何在输入中起作用:

http://localhost:9210/search/_analyze?analyzer = keyword& text = f04((?!z).) *

似乎在这里看起来很好,结果:

{
  "tokens": [
    {
      "token": "f04((?!z).)*",
      "start_offset": 0,
      "end_offset": 12,
      "type": "word",
      "position": 0
    }
  ]
}

我正在运行如下查询:

search_obj = Search(using = _conn, index = _index, doc_type = _type).query(Q('regexp', field1 = "f04baz?"))
count = search_obj.count()
response = search_obj[0:count].execute()
logger.debug("total nodes(hits):" + " " + str(response.hits.total))

请提供帮助,这确实是一个令人讨厌的问题,因为除!之外,所有正则表达式字符都可以在所有查询中正常工作.

此外,如何检查映射中当前应用了上述设置的分析仪?

解决方案

ElasticSearch Lucene正则表达式引擎不支持任何类型的环视. ES regex文档相当模糊,说 匹配像.*之类的东西都非常缓慢,而且使用环视正则表达式 (这不仅是模棱两可的,而且是错误的,因为如果明智地使用环视,可能会大大加快正则表达式的匹配速度).

由于您要匹配包含f04但不包含z的任何字符串,因此您实际上可以使用

[^z]*fo4[^z]*

详细信息

  • [^z]*-除z
  • 之外的任何0+个字符
  • fo4-fo4子字符串
  • [^z]*-除z之外的任何0+个字符.

如果您有一个要排除的多字符字符串(例如,z4而不是z),则可以使用

Problem is :

I am forming ES queries using elasticsearch_dsl Q(). It works perfectly fine in most of the cases when my query contains any complex regexp. But it totally fails if it contains regexp character '!' in it. It doesn't give any result when the search term contains '!' in it.

For eg:

1.) Q('regexp', field1 = "^[a-z]{3}.b.*") (works perfectly)

2.) Q('regexp', field1 = "^f04.*") (works perfectly)

3.)Q('regexp', field1 = "f00.*") (works perfectly)

4.) Q('regexp', field1 = "f04baz?") (works perfectly)

Fails in below case:

5.) Q('regexp', field1 = "f04((?!z).)*") (Fails with no results at all)

I tried adding "analyzer":"keyword" along with "type":"keyword" as above in the fields, but in that case nothing works.

In the browser i tried to check how analyzer:keyword will work on the input on the case it fails:

http://localhost:9210/search/_analyze?analyzer=keyword&text=f04((?!z).)*

Seems to look fine here with result:

{
  "tokens": [
    {
      "token": "f04((?!z).)*",
      "start_offset": 0,
      "end_offset": 12,
      "type": "word",
      "position": 0
    }
  ]
}

I'm running my queries like below:

search_obj = Search(using = _conn, index = _index, doc_type = _type).query(Q('regexp', field1 = "f04baz?"))
count = search_obj.count()
response = search_obj[0:count].execute()
logger.debug("total nodes(hits):" + " " + str(response.hits.total))

PLease help, its really a annoying problem as all the regex characters work fine in all the queries except !.

Also, how do i check what analyzer is currently applied with above setting in my mappings?

解决方案

ElasticSearch Lucene regex engine does not support any type of lookarounds. The ES regex documentation is rather ambiguous saying matching everything like .* is very slow as well as using lookaround regular expressions (which is not only ambiguous, but also wrong since lookarounds, when used wisely, may greatly speed up regex matching).

Since you want to match any string that contains f04 and does not contain z, you may actually use

[^z]*fo4[^z]*

Details

  • [^z]* - any 0+ chars other than z
  • fo4 - fo4 substring
  • [^z]* - any 0+ chars other than z.

In case you have a multicharacter string to "exclude" (say, z4 rather than z), you may use your approach using a complement operator:

.*f04.*&~(.*z4.*)

This means almost the same but does not support line breaks:

  • .* - any chars other than newline, as many as possible
  • f04 - f04
  • .* - any chars other than newline, as many as possible
  • & - AND
  • ~(.*z4.*) - any string other than the one having z4

这篇关于负前瞻Regexp在ES DSL查询中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆