在mongodb中的索引列上查找重复项的快速方法 [英] Fast way to find duplicates on indexed column in mongodb

查看:60
本文介绍了在mongodb中的索引列上查找重复项的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在mongodb中有md5的集合.我想找到所有重复的物品. md5列已建立索引.您知道使用Map Reduce进行此操作的任何快速方法吗? 还是我应该遍历所有记录并手动检查重复项?

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually?

我目前使用map的方法将集合的迭代次数减少了近两倍(假设重复项的数量非常少):

My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):

res = db.files.mapReduce(
    function () {
        emit(this.md5, 1);
    }, 
    function (key, vals) {
        return Array.sum(vals);
    }
)

db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
    out.duplicates.insert(obj)
});

推荐答案

一次通过的最简单方法是按md5排序,然后进行适当处理.

The easiest way to do it in one pass is to sort by md5 and then process appropriately.

类似的东西:

var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {

  if(current.md5 == previous_md5){
    db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
  }

  previous_md5 = current.md5;

});

那个小脚本对md5条目进行排序,并按顺序循环遍历它们.如果重复md5,则排序后它们将背对背".因此,我们只保留一个指向previous_md5的指针并将其与current.md5进行比较.如果找到重复项,则将其放入duplicates集合中(并使用$ inc计数重复项的数量).

That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).

此脚本意味着您只需循环遍历一次主要数据集即可.然后,您可以遍历duplicates集合并执行清理.

This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.

这篇关于在mongodb中的索引列上查找重复项的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆