ElasticSearch Regexp过滤器 [英] ElasticSearch Regexp Filter
问题描述
我正在正确表达ElasticSearch Regexp过滤器的正则表达式问题。我正在尝试匹配url字段中的info-for / media中的任何内容,例如 http://mydomain.co.uk/info-for/media/press -release-1 。要正确使用正则表达式,我现在使用 match_all
,但最终将与 match_phrase
查询字符串。
I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all
for now, but this will eventually be match_phrase
with the user's query string.
POST到本地主机:9200 / _search
{
"query" : {
"match_all" : { },
"filtered" : {
"filter" : {
"regexp": {
"url":".*info-for/media.*"
}
}
}
},
}
这将返回 0个匹配,但是会正确解析。 。* info。*
确实获得包含网址的结果,但不幸的是太广泛了,例如匹配任何包含信息的网址。一旦在info-for中添加连字符,我再次获得0个结果。无论我尝试什么组合的转义字符串,我都会得到一个解析异常,或者没有匹配。有人可以帮忙解释我在做错什么吗?
This returns 0 hits, but does parse correctly. .*info.*
does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?
推荐答案
首先,尽量避免使用没有前缀的正则表达式或通配符。搜索。* foo。*
的方式是,索引字典中的每个单词都与模式匹配,后者又被构建为一个OR-查询匹配项。这是您的语料库中唯一术语数量的O(n),后续的搜索也是相当昂贵的。
First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.*
is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.
这篇文章有一些更多的细节: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
其次,您的网址可能是以索引中的info-for和media单独条款的方式标记的。因此,regexp匹配的字典中没有 info-for / media
-term。
Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media
-term in the dictionary for the regexp to match.
什么您可能想要做的是分别对路径和域进行索引,使用 path_hierarchy -tokenizer来生成这些术语。
What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.
这是一个演示如何生成令牌的例子: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis
Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis
即 / foo / bar / baz
生成令牌 / foo / bar / baz,/ foo / bar,/ foo
域 foo.example.com
被标记为 foo.example.com,example.com,com
I.e. /foo/bar/baz
generates the tokens /foo/bar/baz, /foo/bar, /foo
and the domain foo.example.com
is tokenized to foo.example.com, example.com, com
搜索以下 / foo / bar
中的任何内容可以是一个简单的术语过滤器,匹配路径:/富/酒吧
。这是一个性能更高的过滤器,也可以缓存。
A search for anything in below /foo/bar
could then be a simple term filter matching path:/foo/bar
. That's a massively more performant filter, which can also be cached.
这篇关于ElasticSearch Regexp过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!