正则表达式开始于无法使用Elasticsearch 6. * [英] Regexp starts with not working Elasticsearch 6.*

查看:76
本文介绍了正则表达式开始于无法使用Elasticsearch 6. *的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解ElasticSearch中的正则表达式机制时遇到了麻烦.我有代表物业单位的文件:

I got trouble with understanding regexp mechanizm in ElasticSearch. I have documents that represent property units:

{
    "Unit" :
    {
         "DailyAvailablity" : 
         "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
    }
}

DailyAvailability域从今天起按天数编码未来两年的财产可用性. 'A'表示可用,'U'不可删除,'I'可以签入,'O'可以签出.如何编写正则表达式过滤器以获取特定日期可用的所有单位?

DailyAvailability field codes availability of property by days for the next two years from today. 'A' means available, 'U' unabailable, 'I' can check in, 'O' can check out. How can I write regexp filter to get all units that are available in particular dates?

我试图在DailyAvailability字段中找到具有特定长度和偏移量的'A'子字符串.例如,查找从今天起7天内有7天可用的广告单元:

I tried to find the 'A' substring with particular length and offset in DailyAvailability field. For example to find units that would be available for 7 days in 7 days from today:

{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
        }
      ]
    }
  }
}

此查询返回一个具有DateAvailability的实例单位,该实例单位从"UUUUUUUUUUUUUUUUUUUUIAA"开始,但在字段内部包含合适的序列.如何锚定整个源字符串的正则表达式? ES文档说,lucene regex应该默认锚定.

This query returns for instance unit with DateAvailability that starts from "UUUUUUUUUUUUUUUUUUUIAA", but contains suitable sequences somehere inside the field. How can I anchor regexp for entire source string? ES docs say that lucene regex should be anchored by default.

P.S.我已经尝试过'^.{7}a{7}.*$'.返回空集.

P.S. I have tried '^.{7}a{7}.*$'. Returns empty set.

推荐答案

您似乎正在使用 keyword 数据类型.

It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.

让我详细解释一下.

使用text数据类型会发生的情况是对数据进行了分析以进行全文搜索.它进行了一些转换,例如降低大小写并拆分为令牌.

What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.

让我们尝试使用分析API 根据您的输入:

Let's try to use the Analyze API against your input:

POST _analyze
{
  "text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}

响应为:

{
  "tokens": [
    {
      "token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
      "start_offset": 0,
      "end_offset": 255,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 255,
      "end_offset": 510,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 510,
      "end_offset": 732,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

如您所见,Elasticsearch将您的输入分为三个标记并将它们小写.这看起来是出乎意料的,但是如果您认为它实际上是在尝试简化人类语言中的单词搜索,那是有道理的-没有那么长的单词.

As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.

这就是为什么现在regexp查询".{7}a{7}.*"将匹配的原因:存在一个实际上以许多a开头的令牌,这是

That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.

... Elasticsearch会将正则表达式应用于由 该字段的令牌生成器,而不是该字段的原始文本.

...Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field.

如何使regexp查询考虑整个字符串?

这很简单:不要使用分析仪.类型 keyword 存储您提供的字符串照原样.

How can I make regexp query consider the entire string?

It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.

具有这样的映射:

PUT my_regexes
{
  "mappings": {
    "doc": {
      "properties": {
        "Unit": {
          "properties": {
            "DailyAvailablity": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

您将能够执行类似这样的查询,以匹配帖子中的文档:

You will be able to do a query like this that will match the document from the post:

POST my_regexes/doc/_search
{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*"  }
        }
      ]
    }
  }
}

请注意,由于未分析该字段,因此查询区分大小写.

Note that the query became case-sensitive because the field is not analyzed.

regexp不再返回任何结果:".{12}a{7}.*"

This regexp won't return any results anymore: ".{12}a{7}.*"

这将是:".{12}A{7}.*"

正则表达式为锚定:

Lucene的模式始终是固定的.提供的模式必须与整个字符串匹配.

Lucene’s patterns are always anchored. The pattern provided must match the entire string.

看起来锚定错误的原因很可能是因为令牌在分析的text字段中被拆分了.

The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.

希望有帮助!

这篇关于正则表达式开始于无法使用Elasticsearch 6. *的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆