URL主机名上的Elasticsearch聚合 [英] Elasticsearch aggregate on URL hostname
问题描述
我正在索引包含URL的字段的文档:
I am indexing documents with a field containing a url:
[
'myUrlField' => 'http://google.com/foo/bar'
]
现在,我想从Elasticsearch中脱颖而出的是URL字段上的聚合.
Now what I´d like to get out of elasticsearch is an aggregation on the url field.
curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{
"facets": {
"groupByMyUrlField": {
"terms": {
"field": "myUrlField"
}
}
}
}'
这很好,但是默认的分析器将字段标记化,以便url的每个部分都是一个标记,因此我得到了 http
, google.com 的匹配项code>,
foo
和 bar
.但基本上我只对网址的主机名 google.com
感兴趣.
This is all well and good, but the default analyzer tokenizes the field so that each part of the url is a token, so I get hits for http
, google.com
, foo
and bar
. But basically I am only interested in the hostname of the url, the google.com
.
我可以使用构面按特定标记分组吗?
Can I use facets to group by a specific token?
"field": "myUrlField.0"
还是类似的东西?
查询"not_analyzed"索引也不是一件好事,因为我想按主机名而不是唯一的URL进行分组.
Querying for the "not_analyzed" index is also no good because I want to group by hostname, and not by unique urls.
希望能够在elasticsearch中而不是在我的客户代码中做到这一点.谢谢
Would love to be able to do this in elasticsearch and not in my client code. Thanks
推荐答案
这是一种按域汇总网址的方法:
Here is a way to aggregate urls by domains:
首先,您使用" ),然后使用
First you tokenize the full url as a single token using a keyword tokenizer (which works the same as not_analyzed
under the hood), then you extract the domain with a regex, using a pattern capture token filter. Finally we discard the original full url token thanks to preserve_original
option.
哪个会导致:
{
"settings": {
"analysis": {
"filter": {
"capture_domain_filter": {
"type": "pattern_capture",
"preserve_original": false,
"flags": "CASE_INSENSITIVE",
"patterns": [
"https?:\/\/([^/]+)"
]
}
},
"analyzer": {
"domain_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"capture_domain_filter"
]
}
}
}
},
"mappings": {
"weblink": {
"properties": {
"url": {
"type": "string",
"analyzer": "domain_analyzer"
}
}
}
}
}
我们检查网址的标记方式:
We check how our urls are tokenized:
curl -sXGET http://localhost:9200/url_analyzer/_analyze\?analyzer\=domain_analyzer\&pretty -d 'http://en.wikipedia.org/wiki/Wikipedia' | grep token
"tokens" : [ {
"token" : "en.wikipedia.org",
This looks good, now let's aggregate our urls by domains using latest aggregations features (which will deprecate facets in near future).
curl -XGET "http://localhost:9200/url_analyzer/_search?pretty" -d'
{
"aggregations": {
"tokens": {
"terms": {
"field": "url"
}
}
}
}'
输出:
"aggregations" : {
"tokens" : {
"buckets" : [ {
"key" : "en.wikipedia.org",
"doc_count" : 2
}, {
"key" : "www.elasticsearch.org",
"doc_count" : 1
} ]
}
From here you can go further and apply an additional shingle token filter on top of this to match queries such as "en.wikipedia", "wikipedia.org", if you want to avoid exact matches while searching for a domain.
这篇关于URL主机名上的Elasticsearch聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!