ElasticSearch Regexp过滤器 [英] ElasticSearch Regexp Filter

查看:129
本文介绍了ElasticSearch Regexp过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在正确表达ElasticSearch Regexp过滤器的正则表达式问题。我正在尝试匹配url字段中的info-for / media中的任何内容,例如 http://mydomain.co.uk/info-for/media/press -release-1 。要正确使用正则表达式,我现在使用 match_all ,但最终将与 match_phrase 查询字符串。

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.

POST到本地主机:9200 / _search

{
"query" : {
               "match_all" : { },
               "filtered" : {
                           "filter" : {
                                   "regexp": {
                                        "url":".*info-for/media.*" 
                                    }
                          }
                }
         },
}

这将返回 0个匹配,但是会正确解析。 。* info。* 确实获得包含网址的结果,但不幸的是太广泛了,例如匹配任何包含信息的网址。一旦在info-for中添加连字符,我再次获得0个结果。无论我尝试什么组合的转义字符串,我都会得到一个解析异常,或者没有匹配。有人可以帮忙解释我在做错什么吗?

This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?

推荐答案

首先,尽量避免使用没有前缀的正则表达式或通配符。搜索。* foo。* 的方式是,索引字典中的每个单词都与模式匹配,后者又被构建为一个OR-查询匹配项。这是您的语料库中唯一术语数量的O(n),后续的搜索也是相当昂贵的。

First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.

这篇文章有一些更多的细节: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

其次,您的网址可能是以索引中的info-for和media单独条款的方式标记的。因此,regexp匹配的字典中没有 info-for / media -term。

Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.

什么您可能想要做的是分别对路径和域进行索引,使用 path_hierarchy -tokenizer来生成这些术语。

What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.

这是一个演示如何生成令牌的例子: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

/ foo / bar / baz 生成令牌 / foo / bar / baz,/ foo / bar,/ foo foo.example.com 被标记为 foo.example.com,example.com,com

I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com

搜索以下 / foo / bar 中的任何内容可以是一个简单的术语过滤器,匹配路径:/富/酒吧。这是一个性能更高的过滤器,也可以缓存。

A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

这篇关于ElasticSearch Regexp过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆