从 Elasticsearch 中的搜索中删除重复的文档 [英] Remove duplicate documents from a search in Elasticsearch

查看:37
本文介绍了从 Elasticsearch 中的搜索中删除重复的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个索引,里面有很多相同字段的相同值的论文.我在这个领域有一个重复数据删除.

I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field.

聚合器将作为计数器来找我.我想要一份文件清单.

Aggregators will come to me as counters. I would like a list of documents.

我的索引:

  • 文档 1 {域:'domain1.fr',名称:'name1',日期:'01-01-2014'}
  • 文档 2 {域:'domain1.fr',名称:'name1',日期:'01-02-2014'}
  • 文档 3 {域:'domain2.fr',名称:'name2',日期:'01-03-2014'}
  • 文档 4 {域:'domain2.fr',名称:'name2',日期:'01-04-2014'}
  • 文档 5 {域:'domain3.fr',名称:'name3',日期:'01-05-2014'}
  • 文档 6 {域:'domain3.fr',名称:'name3',日期:'01-06-2014'}

我想要这个结果(域字段的重复数据删除结果):

I want this result (deduplication result by domain field) :

  • 文档 6 {域:'domain3.fr',名称:'name3',日期:'01-06-2014'}
  • 文档 4 {域:'domain2.fr',名称:'name2',日期:'01-04-2014'}
  • 文档 2 {域:'domain1.fr',名称:'name1',日期:'01-02-2014'}

推荐答案

你可以使用 字段折叠,在name字段上对结果进行分组并设置top_hits的大小聚合器为 1.

You could use field collapsing, group the results on the name field and set the size of the top_hits aggregator to 1.

/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true
{
  "aggs":{
    "dedup" : {
      "terms":{
        "field": "name"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
       }    
    }
  }
}

返回:

{
  "took" : 192,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "dedup" : {
      "buckets" : [ {
        "key" : "name1",
        "doc_count" : 2,
        "dedup_docs" : {
          "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "1",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name1", date: "01-01-2014"}
          } ]
        }
      }
    }, {
      "key" : "name2",
      "doc_count" : 2,
      "dedup_docs" : {
        "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "3",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name2", date: "01-03-2014"}
          } ]
        }
      }
    }, {
      "key" : "name3",
      "doc_count" : 2,
      "dedup_docs" : {
        "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "5",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name3", date: "01-05-2014"}
           } ]
         }
       }
     } ]
   }
 }
}

这篇关于从 Elasticsearch 中的搜索中删除重复的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆