如何通过MapReduce删除MongoDB中的重复记录? [英] How to remove duplicate record in MongoDB by MapReduce?

查看:166
本文介绍了如何通过MapReduce删除MongoDB中的重复记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在MongoDB上有一个非常大的集合,我想从该集合中删除重复的记录.我首先想到的是删除索引并使用dropDups重建索引.但是,重复的数据太多,MongoDB无法处理.

I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.

因此,我向MapReduce寻求帮助.这是我目前的进度.

So I turns to MapReduce for help. Here is my current progress.

m = function () { 
    emit(this.myid, 1); 
}

r = function (k, vals) { 
    return Array.sum(vals); 
} 

res = db.userList.mapReduce(m,r, { out : "myoutput" });

所有重复记录的"myid"都存储在"myoutput"集合中.但是,我不知道如何通过引用myoutput.myid从userList中删除记录.它应该是这样的:

And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
        db.userList.remove(xxxxxxxxx) // I don't know how to do so
})

顺便说一句,使用foreach似乎会清除具有健全myid的所有记录.但是我只想删除重复的记录.例如:

Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:

{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }

{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }

最终结果应仅保留一个记录.有人可以帮我这个忙吗?

The final result should preserve only one record. Can someone give me some help on this?

谢谢. :)

推荐答案

最干净的方法可能是编写一个删除记录的客户端脚本:

the cleanest is probably to write a client-side script that deletes records:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
    var cur = db.userList.find({ myid: obj._id }, {_id: 1});
    var first = true;
    while (cur.hasNext()) {
        var doc = cur.next();
        if (first) {first = false; continue;}
        db.userList.remove({ _id: doc._id });
    }
})

我尚未测试此代码,因此请始终仔细检查是否针对产品数据运行.

I have not tested this code so always double check if running against prod data..

这篇关于如何通过MapReduce删除MongoDB中的重复记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆