使用多个字段作为唯一键的Dedup elasticsearch结果 [英] Dedup elasticsearch results using multiple fields as unique key
问题描述
对此也有类似的问题(请参阅删除重复的文档来自Elasticsearch中的搜索),但我还没有找到使用多个字段作为唯一键来进行重复操作的方法。这是一个简单的示例,以说明我在寻找什么:
There have been similar question asked to this (see Remove duplicate documents from a search in Elasticsearch) but I haven't found a way to dedup using multiple fields as the "unique key". Here's a simple example to illustrate a bit of what I'm looking for:
说这是我们的原始数据:
Say this is our raw data:
{ "name": "X", "event": "A", "time": 1 }
{ "name": "X", "event": "B", "time": 2 }
{ "name": "X", "event": "B", "time": 3 }
{ "name": "Y", "event": "A", "time": 4 }
{ "name": "Y", "event": "C", "time": 5 }
我基本上想根据名称和事件获得不同的事件计数。我想避免重复计算两次在相同名称X上发生的事件B,所以我要查找的计数是:
I would essentially like to get the distinct event counts based on name and event. I want to avoid double counting the event B which happened on the same name X twice, so the counts I'd be looking for are:
event: A, count: 2
event: B, count: 1
event: C, count: 1
有没有一种方法可以设置agg查询,如相关问题所示?我讨论过的另一种选择是使用特殊键字段(即 X_A, X_B等)为对象建立索引。然后,我可以在这个领域简单地重复。我不确定哪种方法更可取,但我个人不希望使用额外的元数据来索引数据。
Is there a way to set up an agg query as seen in the related question? Another option I've deliberated is to index the object with a special key field (i.e. "X_A", "X_B", etc.). I could then simply dedup on this field. I'm not sure which is a preferred approach, but I'd personally prefer not to index the data with extra metadata.
推荐答案
您可以在条款
聚合中指定脚本,以便从多个字段中构建密钥:
You can specify a script in a terms
aggregation in order to build a key out of multiple fields:
POST /test/dedup/_search
{
"aggs":{
"dedup" : {
"terms":{
"script": "[doc.name.value, doc.event.value].join('_')"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
这基本上将提供以下结果:
This will basically provide the following results:
- X_A:1
- X_B:2
- Y_A:1
- Y_C:1
- X_A: 1
- X_B: 2
- Y_A: 1
- Y_C: 1
注意:示例数据中只有一个事件 C
,因此除非我错过了某些东西,否则计数不能为两个。
Note: There's only one event C
in your sample data, so the count cannot be two unless I'm missing something.
这篇关于使用多个字段作为唯一键的Dedup elasticsearch结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!