ElasticSearch和Regex查询 [英] ElasticSearch and Regex queries
问题描述
我正在尝试查询在内容字段的正文内有日期的文档。
curl -XGET'http:// localhost:9200 / index / _search'-d'{
:{
regexp:{
content:^(0 [1-9] | [12] [0-9] | 3 [01])[ - /。] 0 [1-9] | 1 [012])[ - /.]((19|20)\\\\\\d)$
}
}
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ p p p p p p p p p p p p p p p p p p p p p p p $ p> curl -XGET'http:// localhost:9200 / index / _search'-d'{
filtered:{
query:{
match_all:{}
},
filter:{
regexp:{
content:^(0 [1-9] 12] [0-9] | 3 [01])[ - /.](0[1-9]|1[012])[- /.]((19|20)\\d\\\ \\ d)$
}
}
}
}'
我的正则表达式似乎已经关闭了。这个正则表达式已经在regex101.com上得到验证。以下查询仍然没有从我拥有的175k文档中返回任何内容。
curl -XPOST'http:// localhost:9200 / index / _search?pretty = true'-d'{
query:{
regexp:{
content:/ [0-9] {4} - [0-9] {2} - [0-9] { 2} | [0-9] {2} - [0-9] {2} - [0-9] {4} | [0-9] {2} / [0-9] {2} / [0 -9] {4} | [0-9] {4} / [0-9] {2} / [0-9] {2} / g
}
}
}'
我开始认为我的索引可能不会被设置为这样的查询。您必须使用什么类型的字段才能使用正则表达式?
映射:{
doc: {
属性:{
content:{
type:string
} title:{
type:string
} host:{
type :string
} cache:{
type:string
}段:{
type:string
}查询:{
属性:{
match_all:{
type:object
}
}
}摘要:{
type:string
} boost:{
type: string
} tstamp:{
格式:dateOptionalTimetype:date
} url:{
type:string
} fields:{
type:string
} anchor:{
type:string
}
}
}
我想查找具有日期的任何记录,并在该日期之前绘制文档的数量。第一步是让这个查询工作。第二步是将日期推迟,并将它们分组。有人可以建议一种办法让第一部分工作,因为我知道第二部分将非常棘手。
谢谢!
你应该阅读Elasticsearch的正则表达式查询文档,您正在对正则表达式查询的工作方式做出一些不正确的假设。
可能这里最重要的是要了解你尝试匹配的字符串是。您正在尝试匹配术语,而不是整个字符串。如果这是与StandardAnalyzer进行索引,我怀疑,您的日期将被分为多个术语:
- 01/01/1901 成为令牌01,01和1901
- 01 01 1901成为令牌01,01和1901
- 01-01-1901成为令牌01,01和1901
- 01.01.1901实际上将是一个令牌:01.01。 1901(由于十进制处理,请参阅 UAX#29 )
您只能将一个整体令牌与正则表达式查询相匹配。
弹性搜索(和lucene)不支持完整的Perl兼容的正则表达式语法。
在您的第二个例子中,您使用的是 ^
和 $
。不支持这些。您的正则表达式必须匹配整个令牌才能获得匹配,因此不需要锚点。
像$ \d
(或 \\d
)也不受支持。而不是 \\d\\d
,请使用 [0-9] {2}
。 / p>
在上次尝试中,您使用的是 / {regex} / g
,这也不受支持。由于您的正则表达式需要匹配整个字符串,所以全局标志在上下文中甚至不会有意义。除非你使用一个使用它们来表示正则表达式的查询解析器,否则你的正则表达式不应该被包含在斜杠中。
(顺便说一下:这个在regex101上有什么验证?你有一堆未转义的 /
s。当我尝试它时,它会抱怨我。)
为了在这样一个分析的字段上支持这种查询,我可能想要查看跨查询,特别是 Span Multiterm 和 Span Near 一>。也许就像:
{
span_near:{
子句:[
{span_multi:{
regexp:{
content:0 [1-9] | [12] [0-9] | 3 [01]
}
},
{span_multi:{
regexp:{
content:0 [1-9] | 1 [012]
},
{span_multi:{
regexp:{
content:(19 | 20)[0-9] {2}
}
}
],
slop:0,
in_order:true
}
}
I am trying to query for documents that have dates within the body of the "content" field.
curl -XGET 'http://localhost:9200/index/_search' -d '{
"query": {
"regexp": {
"content": "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}'
Getting closer maybe?
curl -XGET 'http://localhost:9200/index/_search' -d '{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp":{
"content" : "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}
}'
My regex seems to have been off. This regex has been validated on regex101.com The following query still returns nothing from the 175k documents I have.
curl -XPOST 'http://localhost:9200/index/_search?pretty=true' -d '{
"query": {
"regexp":{
"content" : "/[0-9]{4}-[0-9]{2}-[0-9]{2}|[0-9]{2}-[0-9]{2}-[0-9]{4}|[0-9]{2}/[0-9]{2}/[0-9]{4}|[0-9]{4}/[0-9]{2}/[0-9]{2}/g"
}
}
}'
I am starting to think that my index might not be set up for such a query. What type of field do you have to use to be able to use regular expressions?
mappings: {
doc: {
properties: {
content: {
type: string
}title: {
type: string
}host: {
type: string
}cache: {
type: string
}segment: {
type: string
}query: {
properties: {
match_all: {
type: object
}
}
}digest: {
type: string
}boost: {
type: string
}tstamp: {
format: dateOptionalTimetype: date
}url: {
type: string
}fields: {
type: string
}anchor: {
type: string
}
}
}
I want to find any record that has a date and graph the volume of documents by that date. Step 1. is to get this query working. Step 2. will be to pull the dates out and group them by them accordingly. Can someone suggest a way to get the first part working as I know the second part will be really tricky.
Thanks!
You should read Elasticsearch's Regexp Query documentation carefully, you are making some incorrect assumptions about how the regexp query works.
Probably the most important thing to understand here is what the string you are trying to match is. You are trying to match terms, not the entire string. If this is being indexed with StandardAnalyzer, as I would suspect, your dates will be separated into multiple terms:
- "01/01/1901" becomes tokens "01", "01" and "1901"
- "01 01 1901" becomes tokens "01", "01" and "1901"
- "01-01-1901" becomes tokens "01", "01" and "1901"
- "01.01.1901" actually will be a single token: "01.01.1901" (Due to decimal handling, see UAX #29)
You can only match a single, whole token with a regexp query.
Elasticsearch (and lucene) don't support full Perl-compatible regex syntax.
In your first couple of examples, you are using anchors, ^
and $
. These are not supported. Your regex must match the entire token to get a match anyway, so anchors are not needed.
Shorthand character classes like \d
(or \\d
) are also not supported. Instead of \\d\\d
, use [0-9]{2}
.
In your last attempt, you are using /{regex}/g
, which is also not supported. Since your regex needs to match the whole string, the global flag wouldn't even make sense in context. Unless you are using a query parser which uses them to denote a regex, your regex should not be wrapped in slashes.
(By the way: How did this one validate on regex101? You have a bunch of unescaped /
s. It complains at me when I try it.)
To support this sort of query on such an analyzed field, you'll probably want to look to span queries, and particularly Span Multiterm and Span Near. Perhaps something like:
{
"span_near" : {
"clauses" : [
{ "span_multi" : {
"regexp": {
"content": "0[1-9]|[12][0-9]|3[01]"
}
},
{ "span_multi" : {
"regexp": {
"content": "0[1-9]|1[012]"
}
},
{ "span_multi" : {
"regexp": {
"content": "(19|20)[0-9]{2}"
}
}
],
"slop" : 0,
"in_order" : true
}
}
这篇关于ElasticSearch和Regex查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!