ElasticSearch 和 Regex 查询 [英] ElasticSearch and Regex queries

查看:37
本文介绍了ElasticSearch 和 Regex 查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查询在内容"字段的正文中包含日期的文档.

I am trying to query for documents that have dates within the body of the "content" field.

curl -XGET 'http://localhost:9200/index/_search' -d '{
    "query": {
        "regexp": {
            "content": "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\d\d)$" 
            }
        }
    }'

也许更近一点?

curl -XGET 'http://localhost:9200/index/_search' -d '{
        "filtered": {
        "query": {
            "match_all": {}
        },
        "filter": {
            "regexp":{
                "content" : "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\d\d)$"
                }
            }
        }
    }'

我的正则表达式似乎已关闭.此正则表达式已在 regex101.com 上得到验证.以下查询仍然没有从我拥有的 175k 文档中返回任何内容.

My regex seems to have been off. This regex has been validated on regex101.com The following query still returns nothing from the 175k documents I have.

curl -XPOST 'http://localhost:9200/index/_search?pretty=true' -d '{
        "query": {
            "regexp":{
                "content" : "/[0-9]{4}-[0-9]{2}-[0-9]{2}|[0-9]{2}-[0-9]{2}-[0-9]{4}|[0-9]{2}/[0-9]{2}/[0-9]{4}|[0-9]{4}/[0-9]{2}/[0-9]{2}/g"
            }
        }
    }'

我开始认为我的索引可能没有为这样的查询设置.您必须使用什么类型的字段才能使用正则表达式?

I am starting to think that my index might not be set up for such a query. What type of field do you have to use to be able to use regular expressions?

mappings: {
    doc: {
        properties: {
            content: {
                type: string
            }title: {
                type: string
            }host: {
                type: string
            }cache: {
                type: string
            }segment: {
                type: string
            }query: {
                properties: {
                    match_all: {
                        type: object
                    }
                }
            }digest: {
                type: string
            }boost: {
                type: string
            }tstamp: {
                format: dateOptionalTimetype: date
            }url: {
                type: string
            }fields: {
                type: string
            }anchor: {
                type: string
            }
        }
    }

我想找到任何有日期的记录并绘制该日期之前的文档数量.第 1 步是让这个查询工作.步骤 2. 将拉出日期并相应地将它们分组.有人可以建议一种方法来让第一部分工作,因为我知道第二部分会非常棘手.

I want to find any record that has a date and graph the volume of documents by that date. Step 1. is to get this query working. Step 2. will be to pull the dates out and group them by them accordingly. Can someone suggest a way to get the first part working as I know the second part will be really tricky.

谢谢!

推荐答案

你应该阅读 Elasticsearch 的 Regexp 查询文档 仔细,您对正则表达式查询的工作方式做出了一些不正确的假设.

You should read Elasticsearch's Regexp Query documentation carefully, you are making some incorrect assumptions about how the regexp query works.

这里最重要的可能是你要匹配的字符串是什么.您正在尝试匹配 terms,而不是整个字符串.如果这是使用 StandardAnalyzer 编制索引,正如我所怀疑的那样,您的日期将被分成多个术语:

Probably the most important thing to understand here is what the string you are trying to match is. You are trying to match terms, not the entire string. If this is being indexed with StandardAnalyzer, as I would suspect, your dates will be separated into multiple terms:

  • 01/01/1901"成为标记01"、01"和1901"
  • 01 01 1901"成为标记01"、01"和1901"
  • 01-01-1901"成为标记01"、01"和1901"
  • "01.01.1901" 实际上将是一个标记:"01.01.1901"(由于十进制处理,请参阅 UAX #29)

您只能将一个完整的令牌与正则表达式查询匹配.

You can only match a single, whole token with a regexp query.

Elasticsearch(和 lucene)不支持与 Perl 完全兼容的正则表达式语法.

Elasticsearch (and lucene) don't support full Perl-compatible regex syntax.

在前几个示例中,您使用了锚点,^$.不支持这些.无论如何,您的正则表达式必须匹配整个令牌才能获得匹配项,因此不需要锚点.

In your first couple of examples, you are using anchors, ^ and $. These are not supported. Your regex must match the entire token to get a match anyway, so anchors are not needed.

也不支持像 d(或 \d)这样的简写字符类.使用 [0-9]{2} 代替 \d\d.

Shorthand character classes like d (or \d) are also not supported. Instead of \d\d, use [0-9]{2}.

在您最后一次尝试中,您使用的是 /{regex}/g,这也不支持.由于您的正则表达式需要匹配整个字符串,因此全局标志在上下文中甚至没有意义.除非您使用使用它们来表示正则表达式的查询解析器,否则您的正则表达式不应包含在斜杠中.

In your last attempt, you are using /{regex}/g, which is also not supported. Since your regex needs to match the whole string, the global flag wouldn't even make sense in context. Unless you are using a query parser which uses them to denote a regex, your regex should not be wrapped in slashes.

(顺便说一句:这个是如何在 regex101 上验证的?你有一堆未转义的 / .当我尝试时它向我抱怨.)

(By the way: How did this one validate on regex101? You have a bunch of unescaped /s. It complains at me when I try it.)

要在这样的分析字段上支持此类查询,您可能希望跨查询,尤其是 Span MultitermSpan Near.也许是这样的:

To support this sort of query on such an analyzed field, you'll probably want to look to span queries, and particularly Span Multiterm and Span Near. Perhaps something like:

{
    "span_near" : {
        "clauses" : [
            { "span_multi" : { 
                "match": {
                    "regexp": {"content": "0[1-9]|[12][0-9]|3[01]"}
                }
            }},
            { "span_multi" : { 
                "match": {
                    "regexp": {"content": "0[1-9]|1[012]"}
                }
            }},
            { "span_multi" : { 
                "match": {
                    "regexp": {"content": "(19|20)[0-9]{2}"} 
                }
            }}
        ],
        "slop" : 0,
        "in_order" : true
    }
}

这篇关于ElasticSearch 和 Regex 查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆