从Elasticsearch搜索中删除重复的文档 [英] Remove duplicate documents from a search in Elasticsearch

查看：148 发布时间：2017/7/20 22:16:17 elasticsearch deduplication

本文介绍了从Elasticsearch搜索中删除重复的文档的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个索引与大量的纸张相同的字段相同的值。我在这个领域有一个重复数据删除。

聚合器将作为计数器来到我身边。我想要一份文件清单。

我的索引：

Doc 1 {domain：'domain1.fr'，name：'name1'，date：'01 -01-2014'}

Doc 2 {domain：'domain1.fr'，name：' name1'，date：'01 -02-2014'}

Doc 3 {domain：'domain2.fr'，name：'name2'，date：'01 -03-2014' }

Doc 4 {domain：'domain2.fr'，name：'name2'，date：'01 -04-2014'}

Doc 5 {domain：'domain3.fr'，name：'name3'，date：'01 -05-2014'}

Doc 6 {domain：'domain3.fr'，name ：'name3'，date：'01 -06-2014'}

我想要这个结果（域字段的重复数据删除结果）：

Doc 6 {domain：'domain3.fr'，name：'name3'，date：'01 -06-2014'}

Doc 4 {domain：'domain2.fr'，name：'name2'，date：'01 -04-2014'}

Doc 2 {domain：'domain1.fr'，name：'name1'，date：'01 -02-2014'}

解决方案

您可以使用字段折叠，将结果分组到名称字段，并设置大小 top_hits 聚合器为1。

  / POST http：// localhost： 9200 / test / dedup / _search？search_type = count& pretty = true 
 {
aggs：{
dedup：{
terms：{
field：name
}，
aggs：{
dedup_docs：{
top_hits：{
size 
} 
} 
} 
} 
} 
}

这将返回：

  {
taken：192，
timed_out：false，
_shards：{
total：1，
successful：1，
failed：0 
} ，
hits：{
total：6，
max_scor e：0.0，
hits：[] 
}，
聚合：{
dedup：{
buckets：[{
key：name1，
doc_count：2，
dedup_docs：{
hits：{
total：2，
max_score：1.0，
hits：[{
_index：test，
_type：dedup，
_id ：1，
_score：1.0，
_source：{domain：domain1.fr，name：name1，date：01-01-2014} 
 $] 
} 
} 
}，{
key：name2，
doc_count：2，
dedup_docs ：{
hits：{
total：2，
max_score：1.0，
hits：[{
_index测试，
_type：dedup，
_id：3，
_score：1.0，
_source：{domain：domain1 .fr，name：name2，日期：01-03-2014} 
}] 
} 
} 
}，{
key：name3，
 doc_count：2，
dedup_docs：{
hits：{
total：2，
max_score：1.0，
hits ：[{
_index：test，
_type：dedup，
_id：5，
_score：1.0，
_source：{domain：domain1.fr，name：name3，date：01-05-2014} 
}] 
} 
} 
}] 
} 
} 
}

I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field.

Aggregators will come to me as counters. I would like a list of documents.

My index :

Doc 1 {domain: 'domain1.fr', name: 'name1', date: '01-01-2014'}
Doc 2 {domain: 'domain1.fr', name: 'name1', date: '01-02-2014'}
Doc 3 {domain: 'domain2.fr', name: 'name2', date: '01-03-2014'}
Doc 4 {domain: 'domain2.fr', name: 'name2', date: '01-04-2014'}
Doc 5 {domain: 'domain3.fr', name: 'name3', date: '01-05-2014'}
Doc 6 {domain: 'domain3.fr', name: 'name3', date: '01-06-2014'}

I want this result (deduplication result by domain field) :

Doc 6 {domain: 'domain3.fr', name: 'name3', date: '01-06-2014'}
Doc 4 {domain: 'domain2.fr', name: 'name2', date: '01-04-2014'}
Doc 2 {domain: 'domain1.fr', name: 'name1', date: '01-02-2014'}

解决方案

You could use field collapsing, group the results on the name field and set the size of the top_hits aggregator to 1.

/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true
{
  "aggs":{
    "dedup" : {
      "terms":{
        "field": "name"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
       }    
    }
  }
}

this returns:

{
  "took" : 192,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "dedup" : {
      "buckets" : [ {
        "key" : "name1",
        "doc_count" : 2,
        "dedup_docs" : {
          "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "1",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name1", date: "01-01-2014"}
          } ]
        }
      }
    }, {
      "key" : "name2",
      "doc_count" : 2,
      "dedup_docs" : {
        "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "3",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name2", date: "01-03-2014"}
          } ]
        }
      }
    }, {
      "key" : "name3",
      "doc_count" : 2,
      "dedup_docs" : {
        "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "5",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name3", date: "01-05-2014"}
           } ]
         }
       }
     } ]
   }
 }
}

这篇关于从Elasticsearch搜索中删除重复的文档的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Elasticsearch搜索中删除重复的文档 [英] Remove duplicate documents from a search in Elasticsearch

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从Elasticsearch搜索中删除重复的文档 [英] Remove duplicate documents from a search in Elasticsearch

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭