使用MapReduce删除重复的记录 [英] Removing duplicate records using MapReduce

查看:206
本文介绍了使用MapReduce删除重复的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MongoDB,需要删除重复的记录.我有一个清单收藏集,看起来像这样:(简体)

I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)

[
  { "MlsId": "12345"" },
  { "MlsId": "12345" },
  { "MlsId": "23456" },
  { "MlsId": "23456" },
  { "MlsId": "0" },
  { "MlsId": "0" },
  { "MlsId": "" },
  { "MlsId": "" }
]

如果MlsId不是"或"0",则列表是重复的,而另一个列表具有相同的MlsId.因此,在上面的示例中,需要删除第二条记录和第四条记录.

A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed.

我如何找到所有重复的清单并将其删除?我开始研究MapReduce,但找不到适合我情况的示例.

How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case.

这是我到目前为止的内容,但是它不会检查MlsId是"0"还是":

Here is what I have so far, but it doesn't check if the MlsId is "0" or "":

m = function () { 
    emit(this.MlsId, 1); 
} 

r = function (k, vals) { 
   return Array.sum(vals); 
} 

res = db.Listing.mapReduce(m,r); 
db[res.result].find({value: {$gt: 1}}); 
db[res.result].drop();

推荐答案

我没有使用过mongoDB,但我已经使用过mapreduce.我认为您在mapreduce函数方面处于正确的轨道.要排除0和空字符串,您可以在map函数本身中添加一个检查.

I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like

m = function () { 
  if(this.MlsId!=0 && this.MlsId!="") {    
    emit(this.MlsId, 1); 
  }
} 

和reduce应该返回键值对.所以应该是:

And reduce should return key-value pairs. So it should be:

r = function(k, vals) {
  emit(k,Arrays.sum(vals);
}

此后,您应该在输出中具有一组键-值对,以使键为MlsId,并且值是此特定ID发生的丁香的数量.我不确定db.drop()部分.正如您所指出的,它很可能会删除所有MlsId,而不是仅删除重复的MlsId.要解决此问题,也许您可​​以先调用drop(),然后重新创建一次MlsId.这样对你有用吗?

After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

这篇关于使用MapReduce删除重复的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆