Elasticsearch - 具有通配符的query_string [英] Elasticsearch - query_string with wildcards

查看:318
本文介绍了Elasticsearch - 具有通配符的query_string的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些包含各种格式的URL的弹性搜索文本( http:// www ,www。)我想做的是搜索所有包含例如google.com的文本。

I have some text in elastic search containing urls in various formats (http://www, www.) what I want to do is to search for all texts containing e.g., google.com.

对于当前的搜索,我使用这样的查询:

For the current search I use something like this query:

query = { "query": {
                "bool": {
                     "must": [{
                            "range": {
                            "cdate": {
                                "gt": dfrom,
                                "lte": dto }
                            }
                        },
             { "query_string":{
                "default_operator": "AND",
                "default_field": "text",
                "analyze_wildcard":"true",
                "query": searchString } }
            ]
        }
        }}

但是,像 google.com 的查询从不返回任何结果,搜索例如,术语测试工作正常(没有)我想使用query_string,因为我想使用布尔运算符,但我真的需要能够搜索子字符串不仅仅是整个字。

But a query looking like google.com never returns any result, searching for e.g., the term "test" works fine (without "). I do want to use query_string because I'd like to use boolean operators but I really need to be able to search substrings not only for whole words.

谢谢!

推荐答案

确实 http://www.google.com 将被标准分析器标记为 http www .google.com ,因此 google.com 将不会被发现。

It is true indeed that http://www.google.com will be tokenized by the standard analyzer into http and www.google.com and thus google.com will not be found.

所以单独的标准分析仪在这里不会有帮助,我们需要一个正确转换URL令牌的令牌过滤器。另一种方式,如果您的文本字段仅包含URL,则将使用 UAX电子邮件URL标记器,但由于该字段可以包含任何其他文本(即用户评论),它将无法正常工作。

So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. Another way if your text field only contained URLs would have been to use the UAX Email URL tokenizer, but since the field can contain any other text (i.e. user comments), it won't work.

幸运的是,有一个新的插件,名为 analysis-url ,它提供了一个URL令牌过滤器,这正是我们需要的(在小修改请求,谢谢 @jlinn ;-))

Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) )

首先,您需要安装插件:

First, you need to install the plugin:

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip

然后,我们可以开始玩。我们需要为您的文本字段创建适当的分析器:

Then, we can start playing. We need to create the proper analyzer for your text field:

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "analysis": {
      "filter": {
        "url_host": {
          "type": "url",
          "part": "host",
          "url_decode": true,
          "passthrough": true
        }
      },
      "analyzer": {
        "url_host": {
          "filter": [
            "url_host"
          ],
          "tokenizer": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "url_host"
        }
      }
    }
  }
}'

使用此分析器和映射,我们可以正确索引主机想要能够搜索。例如,我们使用我们的新分析仪来分析字符串 blabla bla http://www.google.com blabla

With this analyzer and mapping, we can properly index the host you want to be able to search for. For instance, let's analyze the string blabla bla http://www.google.com blabla using our new analyzer.

curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'

我们将得到以下令牌:

{
  "tokens" : [ {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "www.google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 5
  } ]
}

请参阅 http://www.google.com 部分将被标记为:

As you can see the http://www.google.com part will be tokenized into:


  • www.google.com

  • google.com ie what you预期

  • com

  • www.google.com
  • google.com i.e. what you expected
  • com

现在,如果您的 searchString google.com ,您将能够找到所有文档,其中包含<$ c包含 google.com (或 www.google.com )的$ c>文本 。

So now if your searchString is google.com you'll be able to find all the documents which have a text field containing google.com (or www.google.com).

这篇关于Elasticsearch - 具有通配符的query_string的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆