从Elasticsearch搜索中删除重复的文档 [英] Remove duplicate documents from a search in Elasticsearch
本文介绍了从Elasticsearch搜索中删除重复的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
聚合器将作为计数器来到我身边。我想要一份文件清单。
我的索引:
- Doc 1 {domain:'domain1.fr',name:'name1',date:'01 -01-2014'}
- Doc 2 {domain:'domain1.fr',name:' name1',date:'01 -02-2014'}
- Doc 3 {domain:'domain2.fr',name:'name2',date:'01 -03-2014' }
- Doc 4 {domain:'domain2.fr',name:'name2',date:'01 -04-2014'}
- Doc 5 {domain:'domain3.fr',name:'name3',date:'01 -05-2014'}
- Doc 6 {domain:'domain3.fr',name :'name3',date:'01 -06-2014'}
我想要这个结果(域字段的重复数据删除结果):
- Doc 6 {domain:'domain3.fr',name:'name3',date:'01 -06-2014'}
- Doc 4 {domain:'domain2.fr',name:'name2',date:'01 -04-2014'}
- Doc 2 {domain:'domain1.fr',name:'name1',date:'01 -02-2014'}
解决方案
您可以使用字段折叠,将结果分组到名称
字段,并设置大小 top_hits
聚合器为1。
/ POST http:// localhost: 9200 / test / dedup / _search?search_type = count& pretty = true
{
aggs:{
dedup:{
terms:{
field:name
},
aggs:{
dedup_docs:{
top_hits:{
size
}
}
}
}
}
}
这将返回:
{
taken:192,
timed_out:false,
_shards:{
total:1,
successful:1,
failed:0
} ,
hits:{
total:6,
max_scor e:0.0,
hits:[]
},
聚合:{
dedup:{
buckets:[{
key:name1,
doc_count:2,
dedup_docs:{
hits:{
total:2,
max_score:1.0,
hits:[{
_index:test,
_type:dedup,
_id :1,
_score:1.0,
_source:{domain:domain1.fr,name:name1,date:01-01-2014}
$]
}
}
},{
key:name2,
doc_count:2,
dedup_docs :{
hits:{
total:2,
max_score:1.0,
hits:[{
_index测试,
_type:dedup,
_id:3,
_score:1.0,
_source:{domain:domain1 .fr,name:name2,日期:01-03-2014}
}]
}
}
},{
key:name3,
doc_count:2,
dedup_docs:{
hits:{
total:2,
max_score:1.0,
hits :[{
_index:test,
_type:dedup,
_id:5,
_score:1.0,
_source:{domain:domain1.fr,name:name3,date:01-05-2014}
}]
}
}
}]
}
}
}
I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field.
Aggregators will come to me as counters. I would like a list of documents.
My index :
- Doc 1 {domain: 'domain1.fr', name: 'name1', date: '01-01-2014'}
- Doc 2 {domain: 'domain1.fr', name: 'name1', date: '01-02-2014'}
- Doc 3 {domain: 'domain2.fr', name: 'name2', date: '01-03-2014'}
- Doc 4 {domain: 'domain2.fr', name: 'name2', date: '01-04-2014'}
- Doc 5 {domain: 'domain3.fr', name: 'name3', date: '01-05-2014'}
- Doc 6 {domain: 'domain3.fr', name: 'name3', date: '01-06-2014'}
I want this result (deduplication result by domain field) :
- Doc 6 {domain: 'domain3.fr', name: 'name3', date: '01-06-2014'}
- Doc 4 {domain: 'domain2.fr', name: 'name2', date: '01-04-2014'}
- Doc 2 {domain: 'domain1.fr', name: 'name1', date: '01-02-2014'}
解决方案
You could use field collapsing, group the results on the name
field and set the size of the top_hits
aggregator to 1.
/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true
{
"aggs":{
"dedup" : {
"terms":{
"field": "name"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
this returns:
{
"took" : 192,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"dedup" : {
"buckets" : [ {
"key" : "name1",
"doc_count" : 2,
"dedup_docs" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "dedup",
"_id" : "1",
"_score" : 1.0,
"_source":{domain: "domain1.fr", name: "name1", date: "01-01-2014"}
} ]
}
}
}, {
"key" : "name2",
"doc_count" : 2,
"dedup_docs" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "dedup",
"_id" : "3",
"_score" : 1.0,
"_source":{domain: "domain1.fr", name: "name2", date: "01-03-2014"}
} ]
}
}
}, {
"key" : "name3",
"doc_count" : 2,
"dedup_docs" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "dedup",
"_id" : "5",
"_score" : 1.0,
"_source":{domain: "domain1.fr", name: "name3", date: "01-05-2014"}
} ]
}
}
} ]
}
}
}
这篇关于从Elasticsearch搜索中删除重复的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文