从 Elasticsearch 中的搜索中删除重复的文档 [英] Remove duplicate documents from a search in Elasticsearch
本文介绍了从 Elasticsearch 中的搜索中删除重复的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个索引,里面有很多相同字段的相同值的论文.我在这个领域有一个重复数据删除.
I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field.
聚合器将作为计数器来找我.我想要一份文件清单.
Aggregators will come to me as counters. I would like a list of documents.
我的索引:
- 文档 1 {域:'domain1.fr',名称:'name1',日期:'01-01-2014'}
- 文档 2 {域:'domain1.fr',名称:'name1',日期:'01-02-2014'}
- 文档 3 {域:'domain2.fr',名称:'name2',日期:'01-03-2014'}
- 文档 4 {域:'domain2.fr',名称:'name2',日期:'01-04-2014'}
- 文档 5 {域:'domain3.fr',名称:'name3',日期:'01-05-2014'}
- 文档 6 {域:'domain3.fr',名称:'name3',日期:'01-06-2014'}
我想要这个结果(域字段的重复数据删除结果):
I want this result (deduplication result by domain field) :
- 文档 6 {域:'domain3.fr',名称:'name3',日期:'01-06-2014'}
- 文档 4 {域:'domain2.fr',名称:'name2',日期:'01-04-2014'}
- 文档 2 {域:'domain1.fr',名称:'name1',日期:'01-02-2014'}
推荐答案
你可以使用 字段折叠,在name
字段上对结果进行分组并设置top_hits
的大小聚合器为 1.
You could use field collapsing, group the results on the name
field and set the size of the top_hits
aggregator to 1.
/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true
{
"aggs":{
"dedup" : {
"terms":{
"field": "name"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
返回:
{
"took" : 192,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"dedup" : {
"buckets" : [ {
"key" : "name1",
"doc_count" : 2,
"dedup_docs" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "dedup",
"_id" : "1",
"_score" : 1.0,
"_source":{domain: "domain1.fr", name: "name1", date: "01-01-2014"}
} ]
}
}
}, {
"key" : "name2",
"doc_count" : 2,
"dedup_docs" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "dedup",
"_id" : "3",
"_score" : 1.0,
"_source":{domain: "domain1.fr", name: "name2", date: "01-03-2014"}
} ]
}
}
}, {
"key" : "name3",
"doc_count" : 2,
"dedup_docs" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "dedup",
"_id" : "5",
"_score" : 1.0,
"_source":{domain: "domain1.fr", name: "name3", date: "01-05-2014"}
} ]
}
}
} ]
}
}
}
这篇关于从 Elasticsearch 中的搜索中删除重复的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文