弹性搜索删除重复 [英] Elasticsearch delete duplicates
问题描述
某些记录在我的索引中重复,由数字字段 recordid
标识。
有删除在弹性搜索中使用查询,我可以用它来删除任何一个重复记录吗?
还是以其他方式实现这一点?
$ b $是的,您可以找到具有聚合查询的重复文档: curl -XPOST http:// localhost:9200 / your_index / _search -d'
{
size:0,
aggs:{
duplicate $$ {
b $ bbbbb $ bbbb
aggs:{
duplicateDocuments:{
top_hits:{
size:10
}
}
}
}
}
}'
然后删除重复的文档最好使用批量查询。请查看 es-deduplicator 以进行自动重复删除(免责声明:我是该作者的作者)脚本)。
注意:聚合查询可能非常昂贵,可能会导致节点崩溃(如果您的索引太大数据节点数太少)。
Some of the records are duplicated in my index identified by a numeric field recordid
.
There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?
Or some other way to achieve this?
Yes, you can find duplicated document with an aggregation query:
curl -XPOST http://localhost:9200/your_index/_search -d '
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "recordid",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 10
}
}
}
}
}
}'
then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).
NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).
这篇关于弹性搜索删除重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!