正则表达式从 Elasticsearch 6.* 不工作开始 [英] Regexp starts with not working Elasticsearch 6.*

查看:17
本文介绍了正则表达式从 Elasticsearch 6.* 不工作开始的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解 ElasticSearch 中的正则表达式机制时遇到了麻烦.我有代表财产单位的文件:

I got trouble with understanding regexp mechanizm in ElasticSearch. I have documents that represent property units:

{
    "Unit" :
    {
         "DailyAvailablity" : 
         "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
    }
}

DailyAvailability 字段代码从今天起未来两年内按天数显示的物业可用性.'A' 表示有空,'U' 表示不可用,'I' 可以签入,'O' 可以签出.如何编写正则表达式过滤器以获取特定日期可用的所有单位?

DailyAvailability field codes availability of property by days for the next two years from today. 'A' means available, 'U' unabailable, 'I' can check in, 'O' can check out. How can I write regexp filter to get all units that are available in particular dates?

我试图在 DailyAvailability 字段中找到具有特定长度和偏移量的A"子字符串.例如,要查找从今天起 7 天内可用 7 天的单位:

I tried to find the 'A' substring with particular length and offset in DailyAvailability field. For example to find units that would be available for 7 days in 7 days from today:

{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
        }
      ]
    }
  }
}

此查询返回例如具有从UUUUUUUUUUUUUUUUUUUIAA"开始的 DateAvailability 的单元,但在字段内的某处包含合适的序列.如何为整个源字符串锚定正则表达式?ES 文档说 lucene 正则表达式应该是默认锚定的.

This query returns for instance unit with DateAvailability that starts from "UUUUUUUUUUUUUUUUUUUIAA", but contains suitable sequences somehere inside the field. How can I anchor regexp for entire source string? ES docs say that lucene regex should be anchored by default.

附:我试过 '^.{7}a{7}.*$'.返回空集.

P.S. I have tried '^.{7}a{7}.*$'. Returns empty set.

推荐答案

看起来你正在使用 text 数据类型来存储 Unit.DailyAvailability (如果您使用 动态映射).您应该考虑使用 keyword 数据类型.

It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.

让我更详细地解释一下.

Let me explain in a bit more detail.

text 数据类型发生的情况是数据被分析以进行全文搜索.它会进行一些转换,例如小写和拆分为标记.

What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.

让我们尝试使用 Analyze API 反对您的意见:

Let's try to use the Analyze API against your input:

POST _analyze
{
  "text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}

回复是:

{
  "tokens": [
    {
      "token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
      "start_offset": 0,
      "end_offset": 255,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 255,
      "end_offset": 510,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
      "start_offset": 510,
      "end_offset": 732,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

如您所见,Elasticsearch 已将您的输入分成三个标记并将它们小写.这看起来出乎意料,但如果您认为它实际上是在尝试促进人类语言中的单词搜索,那是有道理的 - 没有这么长的单词.

As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.

这就是为什么现在 regexp 查询 ".{7}a{7}.*" 会匹配:有一个令牌实际上以很多a,这是一个 regexp 查询的预期行为.

That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.

...Elasticsearch 会将正则表达式应用于由该字段的标记器,而不是该字段的原始文本.

...Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field.

如何让 regexp 查询考虑整个字符串?

这很简单:不要应用分析器.类型 keyword按原样存储您提供的字符串.

How can I make regexp query consider the entire string?

It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.

使用这样的映射:

PUT my_regexes
{
  "mappings": {
    "doc": {
      "properties": {
        "Unit": {
          "properties": {
            "DailyAvailablity": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

您将能够执行这样的查询,以匹配帖子中的文档:

You will be able to do a query like this that will match the document from the post:

POST my_regexes/doc/_search
{
 "query": {
   "bool": {
     "filter": [
        {
         "regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*"  }
        }
      ]
    }
  }
}

请注意,由于未分析该字段,因此查询变得区分大小写.

Note that the query became case-sensitive because the field is not analyzed.

这个 regexp 将不再返回任何结果:".{12}a{7}.*"

This regexp won't return any results anymore: ".{12}a{7}.*"

这将:".{12}A{7}.*"

正则表达式是 锚定:

Lucene 的模式总是被锚定的.提供的模式必须匹配整个字符串.

Lucene’s patterns are always anchored. The pattern provided must match the entire string.

看起来锚定错误的原因很可能是因为标记在分析的 text 字段中被拆分.

The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.

这篇关于正则表达式从 Elasticsearch 6.* 不工作开始的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆