弹性搜索删除重复 [英] Elasticsearch delete duplicates

查看:134
本文介绍了弹性搜索删除重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

某些记录在我的索引中重复,由数字字段 recordid 标识。



有删除在弹性搜索中使用查询,我可以用它来删除任何一个重复记录吗?



还是以其他方式实现这一点?

$ b $是的,您可以找到具有聚合查询的重复文档:

  curl -XPOST http:// localhost:9200 / your_index / _search -d'
{
size:0,
aggs:{
duplicate $$ {
b $ bbbbb $ bbbb
aggs:{
duplicateDocuments:{
top_hits:{
size:10
}
}
}
}
}
}'

然后删除重复的文档最好使用批量查询。请查看 es-deduplicator 以进行自动重复删除(免责声明:我是该作者的作者)脚本)。



注意:聚合查询可能非常昂贵,可能会导致节点崩溃(如果您的索引太大数据节点数太少)。


Some of the records are duplicated in my index identified by a numeric field recordid.

There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?

Or some other way to achieve this?

解决方案

Yes, you can find duplicated document with an aggregation query:

curl -XPOST http://localhost:9200/your_index/_search -d '
 {
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "recordid",
        "min_doc_count": 2,
        "size": 10
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}'

then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).

NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).

这篇关于弹性搜索删除重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆