ElasticSearch Regexp过滤器 [英] ElasticSearch Regexp Filter

查看：129 发布时间：2017/8/7 1:10:04 regex elasticsearch

本文介绍了ElasticSearch Regexp过滤器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在正确表达ElasticSearch Regexp过滤器的正则表达式问题。我正在尝试匹配url字段中的info-for / media中的任何内容，例如 http://mydomain.co.uk/info-for/media/press -release-1 。要正确使用正则表达式，我现在使用 match_all ，但最终将与 match_phrase 查询字符串。

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.

POST到本地主机：9200 / _search

{
"query" : {
               "match_all" : { },
               "filtered" : {
                           "filter" : {
                                   "regexp": {
                                        "url":".*info-for/media.*" 
                                    }
                          }
                }
         },
}

这将返回 0个匹配，但是会正确解析。 。* info。* 确实获得包含网址的结果，但不幸的是太广泛了，例如匹配任何包含信息的网址。一旦在info-for中添加连字符，我再次获得0个结果。无论我尝试什么组合的转义字符串，我都会得到一个解析异常，或者没有匹配。有人可以帮忙解释我在做错什么吗？

This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?

推荐答案

首先，尽量避免使用没有前缀的正则表达式或通配符。搜索。* foo。* 的方式是，索引字典中的每个单词都与模式匹配，后者又被构建为一个OR-查询匹配项。这是您的语料库中唯一术语数量的O（n），后续的搜索也是相当昂贵的。

First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.

这篇文章有一些更多的细节： https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

其次，您的网址可能是以索引中的info-for和media单独条款的方式标记的。因此，regexp匹配的字典中没有 info-for / media -term。

Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.

什么您可能想要做的是分别对路径和域进行索引，使用 path_hierarchy -tokenizer来生成这些术语。

What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.

这是一个演示如何生成令牌的例子： https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

即 / foo / bar / baz 生成令牌 / foo / bar / baz，/ foo / bar，/ foo 域 foo.example.com 被标记为 foo.example.com，example.com，com

I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com

搜索以下 / foo / bar 中的任何内容可以是一个简单的术语过滤器，匹配路径：/富/酒吧。这是一个性能更高的过滤器，也可以缓存。

A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

这篇关于ElasticSearch Regexp过滤器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ElasticSearch Regexp过滤器 [英] ElasticSearch Regexp Filter

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

ElasticSearch Regexp过滤器 [英] ElasticSearch Regexp Filter

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭