如何通过 MapReduce 删除 MongoDB 中的重复记录? [英] How to remove duplicate record in MongoDB by MapReduce?

查看:20
本文介绍了如何通过 MapReduce 删除 MongoDB 中的重复记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 MongoDB 上有一个非常大的集合,我想从该集合中删除重复记录.我想到的第一个想法是删除索引并使用 dropDups 重建索引.但是,重复数据太多,MongoDB无法处理.

I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.

所以我求助于 MapReduce.这是我目前的进度.

So I turns to MapReduce for help. Here is my current progress.

m = function () { 
    emit(this.myid, 1); 
}

r = function (k, vals) { 
    return Array.sum(vals); 
} 

res = db.userList.mapReduce(m,r, { out : "myoutput" });

并且所有重复记录的myid"都存储在myoutput"集合中.但是,我不知道如何通过引用 myoutput.myid 从 userList 中删除记录.应该是这样的:

And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
        db.userList.remove(xxxxxxxxx) // I don't know how to do so
})

顺便说一句,使用 foreach 似乎会用理智的 myid 擦除所有记录.但我只想删除重复的记录.例如:

Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:

{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }

{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }

最终结果应该只保留一条记录.有人可以帮我解决这个问题吗?

The final result should preserve only one record. Can someone give me some help on this?

谢谢.:)

推荐答案

最干净的可能是写一个删除记录的客户端脚本:

the cleanest is probably to write a client-side script that deletes records:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
    var cur = db.userList.find({ myid: obj._id }, {_id: 1});
    var first = true;
    while (cur.hasNext()) {
        var doc = cur.next();
        if (first) {first = false; continue;}
        db.userList.remove({ _id: doc._id });
    }
})

我尚未测试此代码,因此请务必仔细检查是否针对产品数据运行..

I have not tested this code so always double check if running against prod data..

这篇关于如何通过 MapReduce 删除 MongoDB 中的重复记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆