弹性搜索组由多个字段组成 [英] elasticsearch group-by multiple fields
问题描述
我正在寻找在弹性搜索中分组数据的最佳方法。
Elasticsearch不支持sql中的group by。
i am Looking for the best way to group data in elasticsearch. Elasticsearch doesnt support something like 'group by' in sql.
说我有1k类别和数百万的产品。你认为渲染一个完整的类树是最好的方法?的couse jou需要一些元数据(图标,链接目标,seo标题,...)和类别的自定义排序。
Lets say i have 1k categories and millions of products. what do you think is the best way to render a complete category tree? of couse jou need some metadata (icon, link-target, seo-titles,...) and custom sorting for the categories.
-
使用聚合:
示例: https://found.no/play/gist/8124563 如果您必须按一个字段分组,并且需要一些额外的字段,则
看起来可用。
Using Aggregations: Example: https://found.no/play/gist/8124563 looks useable if you have to group by one field, and need some extra fields.
在方面使用多个字段(不工作)
示例: https://found.no/play/gist/1aa44e2114975384a7c2
这里我们失去了不同字段之间的关系。
Using multiple Fields in a Facet (wont work) Example: https://found.no/play/gist/1aa44e2114975384a7c2 Here we lose the relation between the different fields.
构建有趣的facets
https://found.no/play/gist/8124810
Building funny Facets https://found.no/play/gist/8124810
例如使用这3个解决方案构建一个类树。
解决方案1可能工作(ES 1现在不稳定)
解决方案2不工作
解决方案3是痛苦的,因为它感到丑陋,你需要准备大量的数据和方面的打击
for example building a category tree using this 3 "solutions" sucks. Solution 1 may work (ES 1 isnt stable right now) Solution 2 doesnt work Solution 3 is pain, because it feels ugly, you need to prepare a lot of data and the facets blow up.
可能不是在ES中存储任何类别数据,只是id
https://found.no/play/gist/a53e46c91e2bf077f2e1
May an alternative could be not to store any category data in ES, just the id https://found.no/play/gist/a53e46c91e2bf077f2e1
比你可以得到与另一个系统相关的类别,如redis,memcache或数据库。
than you could get the assocated category from another system, like redis, memcache or the database.
这将导致干净的代码,但性能可能会成为问题。
例如加载1k从memcache / Redis / a数据库的类别可能很慢。
另一个问题是,同步2个数据库比同步一个更难。
this would end up in clean code, but the performance could become a problem. for example loading 1k Categories from memcache / Redis / a database could be slow. another problem is that syncing 2 databases is harder than syncing one.
你如何处理这些问题?
我很抱歉的链接,但我不能在一篇文章中发布超过2个。
i am sorry for the links, but i cant post more than 2 ones in one article.
推荐答案
聚合API允许使用子聚合通过多个字段进行分组。假设您想按字段分组 field1
, field2
和 field3
:
The aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1
, field2
and field3
:
{
"aggs": {
"agg1": {
"terms": {
"field": "field1"
},
"aggs": {
"agg2": {
"terms": {
"field": "field2"
},
"aggs": {
"agg3": {
"terms": {
"field": "field3"
}
}
}
}
}
}
}
}
当然,这可以继续为您所需的多个字段。
Of course this can go on for as many fields as you'd like.
更新:
为了完整,以上是上述查询的输出。另外下面是用于生成聚合查询的python代码,并将结果展开为字典列表。
Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.
{
"aggregations": {
"agg1": {
"buckets": [{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field1>,
"agg2": {
"buckets": [{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
},
{
"doc_count": <count>,
"key": <value of field2>,
"agg3": {
"buckets": [{
"doc_count": <count>,
"key": <value of field3>
},
{
"doc_count": <count>,
"key": <value of field3>
}, ...
]
}, ...
]
}, ...
]
}
}
}
以下python代码执行group-by给出字段列表。我指定 include_missing = True
,它还包括缺少某些字段的值的组合(如果您有2.0版的Elasticsearch感谢< a href =https://github.com/elastic/elasticsearch/pull/11042> this )
The following python code performs the group-by given the list of fields. I you specify include_missing=True
, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)
def group_by(es, fields, include_missing):
current_level_terms = {'terms': {'field': fields[0]}}
agg_spec = {fields[0]: current_level_terms}
if include_missing:
current_level_missing = {'missing': {'field': fields[0]}}
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = {'terms': {'field': field}}
current_level_terms['aggs'] = {
field: next_level_terms,
}
if include_missing:
next_level_missing = {'missing': {'field': field}}
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] = {
field: next_level_terms,
field + '_missing': next_level_missing,
}
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body={'aggs': agg_spec})['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
{
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
}
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result
这篇关于弹性搜索组由多个字段组成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!