Elasticsearch - 带通配符的 query_string [英] Elasticsearch - query_string with wildcards

查看:120
本文介绍了Elasticsearch - 带通配符的 query_string的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在弹性搜索中有一些文本,其中包含各种格式的网址(http://www、www.)我想要做的是搜索所有包含例如 google.com 的文本.

I have some text in elastic search containing urls in various formats (http://www, www.) what I want to do is to search for all texts containing e.g., google.com.

对于当前的搜索,我使用类似这样的查询:

For the current search I use something like this query:

query = { "query": {
                "bool": {
                     "must": [{
                            "range": {
                            "cdate": {
                                "gt": dfrom,
                                "lte": dto }
                            }
                        },
             { "query_string":{
                "default_operator": "AND",
                "default_field": "text",
                "analyze_wildcard":"true",
                "query": searchString } }
            ]
        }
        }}

但是看起来像 google.com 的查询永远不会返回任何结果,例如搜索术语test"工作正常(没有).我确实想使用 query_string 因为我想要使用布尔运算符,但我真的需要能够不仅搜索整个单词的子字符串.

But a query looking like google.com never returns any result, searching for e.g., the term "test" works fine (without "). I do want to use query_string because I'd like to use boolean operators but I really need to be able to search substrings not only for whole words.

谢谢!

推荐答案

http://www.google.com 确实会被标准分析器标记为 httpwww.google.com,因此 google.com 将不会被找到.

It is true indeed that http://www.google.com will be tokenized by the standard analyzer into http and www.google.com and thus google.com will not be found.

所以单独的标准分析器在这里没有帮助,我们需要一个可以正确转换 URL 标记的标记过滤器.如果您的 text 字段仅包含 URL,另一种方法是使用 UAX 电子邮件 URL 标记器,但由于该字段可以包含任何其他文本(即用户评论),因此它不起作用.

So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. Another way if your text field only contained URLs would have been to use the UAX Email URL tokenizer, but since the field can contain any other text (i.e. user comments), it won't work.

幸运的是,有一个名为 analysis-url 的新插件,它提供了一个 URL 令牌过滤器,这正是我们需要的(经过 小修改恳求,谢谢 @jlinn ;-) )

Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) )

首先,您需要安装插件:

First, you need to install the plugin:

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip

那么,我们就可以开始玩了.我们需要为您的 text 字段创建合适的分析器:

Then, we can start playing. We need to create the proper analyzer for your text field:

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "analysis": {
      "filter": {
        "url_host": {
          "type": "url",
          "part": "host",
          "url_decode": true,
          "passthrough": true
        }
      },
      "analyzer": {
        "url_host": {
          "filter": [
            "url_host"
          ],
          "tokenizer": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "url_host"
        }
      }
    }
  }
}'

使用此分析器和映射,我们可以正确索引您希望能够搜索的主机.例如,让我们使用新的分析器分析字符串 blabla bla http://www.google.com blabla.

With this analyzer and mapping, we can properly index the host you want to be able to search for. For instance, let's analyze the string blabla bla http://www.google.com blabla using our new analyzer.

curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'

我们将获得以下令牌:

{
  "tokens" : [ {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "www.google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 5
  } ]
}

如您所见,http://www.google.com 部分将被标记为:

As you can see the http://www.google.com part will be tokenized into:

  • www.google.com
  • google.com 即您所期望的
  • com
  • www.google.com
  • google.com i.e. what you expected
  • com

所以现在如果您的 searchStringgoogle.com,您将能够找到所有包含 text 字段的文档code>google.com(或 www.google.com).

So now if your searchString is google.com you'll be able to find all the documents which have a text field containing google.com (or www.google.com).

这篇关于Elasticsearch - 带通配符的 query_string的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆