如何在MongoDb中删除重复项? [英] How can I delete duplicates in MongoDb?

查看:169
本文介绍了如何在MongoDb中删除重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的收藏(约270万个文件)在mongodb,有很多重复。我试图在集合上运行 ensureIndex({id:1},{unique:true,dropDups:true})。 Mongo在这一段时间内消失了一段时间,之后才决定使用dropDups = true 创建索引的太多的重复。



如何添加索引并摆脱重复项?或者另一种方法,删除一些重复的最好方法是什么,以便mongo可以成功构建索引?



对于奖励积分,为什么数字有限制可以删除的副本?

解决方案


对于奖励积分,为什么有一个限制可以删除的dup的数量?


MongoDB很有可能为自己辩护。如果您在错误的字段上 dropDups ,您可以软管整个数据集,并通过删除操作锁定数据库(与写入一样昂贵)。


如何添加索引并摆脱重复项?


所以第一个问题是为什么在 id 字段创建一个唯一的索引?



MongoDB创建自动唯一的索引的默认 _id 字段。默认情况下,MongoDB使用 ObjectId 填充 _id ,但是您可以用任何您喜欢的值覆盖该值。 所以如果您有一套现成的ID值,可以使用这些



如果无法重新导入值,则将其复制将 id 更改为 _id 中的新集合。然后,您可以删除旧集合并重命名新集合。 (请注意,您将收到一堆重复键错误,确保您的代码捕获并忽略它们)


I have a large collection (~2.7 million documents) in mongodb, and there are a lot of duplicates. I tried running ensureIndex({id:1}, {unique:true, dropDups:true}) on the collection. Mongo churns away at it for a while before it decides that too many dups on index build with dropDups=true.

How can I add the index and get rid of the duplicates? Or the other way around, what's the best way to delete some dups so that mongo can successfully build the index?

For bonus points, why is there a limit to the number of dups that can be dropped?

解决方案

For bonus points, why is there a limit to the number of dups that can be dropped?

MongoDB is likely doing this to defend itself. If you dropDups on the wrong field, you could hose the entire dataset and lock down the DB with delete operations (which are "as expensive" as writes).

How can I add the index and get rid of the duplicates?

So the first question is why are you creating a unique index on the id field?

MongoDB creates a default _id field that is automatically unique and indexed. By default MongoDB populates the _id with an ObjectId, however, you can override this with whatever value you like. So if you have a ready set of ID values, you can use those.

If you cannot re-import the values, then copy them to a new collection while changing id into _id. You can then drop the old collection and rename the new one. (note that you will get a bunch of "duplicate key errors", ensure that your code catches and ignores them)

这篇关于如何在MongoDb中删除重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆