使用多个字段作为唯一键的Dedup elasticsearch结果 [英] Dedup elasticsearch results using multiple fields as unique key

查看:130
本文介绍了使用多个字段作为唯一键的Dedup elasticsearch结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对此也有类似的问题(请参阅删除重复的文档来自Elasticsearch中的搜索),但我还没有找到使用多个字段作为唯一键来进行重复操作的方法。这是一个简单的示例,以说明我在寻找什么:

There have been similar question asked to this (see Remove duplicate documents from a search in Elasticsearch) but I haven't found a way to dedup using multiple fields as the "unique key". Here's a simple example to illustrate a bit of what I'm looking for:

说这是我们的原始数据:

Say this is our raw data:

{ "name": "X", "event": "A", "time": 1 }
{ "name": "X", "event": "B", "time": 2 }
{ "name": "X", "event": "B", "time": 3 }
{ "name": "Y", "event": "A", "time": 4 }
{ "name": "Y", "event": "C", "time": 5 }

我基本上想根据名称和事件获得不同的事件计数。我想避免重复计算两次在相同名称X上发生的事件B,所以我要查找的计数是:

I would essentially like to get the distinct event counts based on name and event. I want to avoid double counting the event B which happened on the same name X twice, so the counts I'd be looking for are:

event: A, count: 2
event: B, count: 1
event: C, count: 1

有没有一种方法可以设置agg查询,如相关问题所示?我讨论过的另一种选择是使用特殊键字段(即 X_A, X_B等)为对象建立索引。然后,我可以在这个领域简单地重复。我不确定哪种方法更可取,但我个人不希望使用额外的元数据来索引数据。

Is there a way to set up an agg query as seen in the related question? Another option I've deliberated is to index the object with a special key field (i.e. "X_A", "X_B", etc.). I could then simply dedup on this field. I'm not sure which is a preferred approach, but I'd personally prefer not to index the data with extra metadata.

推荐答案

您可以在条款聚合中指定脚本,以便从多个字段中构建密钥:

You can specify a script in a terms aggregation in order to build a key out of multiple fields:

POST /test/dedup/_search
{
  "aggs":{
    "dedup" : {
      "terms":{
        "script": "[doc.name.value, doc.event.value].join('_')"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
       }    
    }
  }
}

这基本上将提供以下结果:

This will basically provide the following results:


  • X_A:1

  • X_B:2

  • Y_A:1

  • Y_C:1

  • X_A: 1
  • X_B: 2
  • Y_A: 1
  • Y_C: 1

注意:示例数据中只有一个事件 C ,因此除非我错过了某些东西,否则计数不能为两个。

Note: There's only one event C in your sample data, so the count cannot be two unless I'm missing something.

这篇关于使用多个字段作为唯一键的Dedup elasticsearch结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆