URL主机名上的Elasticsearch聚合 [英] Elasticsearch aggregate on URL hostname

查看:79
本文介绍了URL主机名上的Elasticsearch聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在索引包含URL的字段的文档:

I am indexing documents with a field containing a url:

[
    'myUrlField' => 'http://google.com/foo/bar'
]

现在,我想从Elasticsearch中脱颖而出的是URL字段上的聚合.

Now what I´d like to get out of elasticsearch is an aggregation on the url field.

curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{
  "facets": {
    "groupByMyUrlField": {
      "terms": {
        "field": "myUrlField"
      }
    }
  }
}'

这很好,但是默认的分析器将字段标记化,以便url的每个部分都是一个标记,因此我得到了 http google.com 的匹配项code>, foo bar .但基本上我只对网址的主机名 google.com 感兴趣.

This is all well and good, but the default analyzer tokenizes the field so that each part of the url is a token, so I get hits for http, google.com, foo and bar. But basically I am only interested in the hostname of the url, the google.com.

我可以使用构面按特定标记分组吗?

Can I use facets to group by a specific token?

"field": "myUrlField.0"

还是类似的东西?

查询"not_analyzed"索引也不是一件好事,因为我想按主机名而不是唯一的URL进行分组.

Querying for the "not_analyzed" index is also no good because I want to group by hostname, and not by unique urls.

希望能够在elasticsearch中而不是在我的客户代码中做到这一点.谢谢

Would love to be able to do this in elasticsearch and not in my client code. Thanks

推荐答案

这是一种按域汇总网址的方法:

Here is a way to aggregate urls by domains:

首先,您使用),然后使用

First you tokenize the full url as a single token using a keyword tokenizer (which works the same as not_analyzed under the hood), then you extract the domain with a regex, using a pattern capture token filter. Finally we discard the original full url token thanks to preserve_original option.

哪个会导致:

{
  "settings": {
    "analysis": {
      "filter": {
        "capture_domain_filter": {
          "type": "pattern_capture",
          "preserve_original": false,
          "flags": "CASE_INSENSITIVE",
          "patterns": [
            "https?:\/\/([^/]+)"
          ]
        }
      },
      "analyzer": {
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_domain_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "weblink": {
      "properties": {
        "url": {
          "type": "string",
          "analyzer": "domain_analyzer"
        }
      }
    }
  }
}

我们检查网址的标记方式:

We check how our urls are tokenized:

curl -sXGET http://localhost:9200/url_analyzer/_analyze\?analyzer\=domain_analyzer\&pretty -d 'http://en.wikipedia.org/wiki/Wikipedia' | grep token
  "tokens" : [ {
    "token" : "en.wikipedia.org",

这看起来不错,现在让我们使用

This looks good, now let's aggregate our urls by domains using latest aggregations features (which will deprecate facets in near future).

curl -XGET "http://localhost:9200/url_analyzer/_search?pretty" -d'
{
  "aggregations": {
    "tokens": {
      "terms": {
        "field": "url"
      }
    }
  }
}'

输出:

"aggregations" : {
    "tokens" : {
      "buckets" : [ {
        "key" : "en.wikipedia.org",
        "doc_count" : 2
      }, {
        "key" : "www.elasticsearch.org",
        "doc_count" : 1
      } ]
    }

从这里开始,您可以进一步应用其他

From here you can go further and apply an additional shingle token filter on top of this to match queries such as "en.wikipedia", "wikipedia.org", if you want to avoid exact matches while searching for a domain.

这篇关于URL主机名上的Elasticsearch聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆