使用 MapReduce 删除重复记录 [英] Removing duplicate records using MapReduce

查看:25
本文介绍了使用 MapReduce 删除重复记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 MongoDB,需要删除重复记录.我有一个看起来像这样的列表集合:(简化)

I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)

[
  { "MlsId": "12345"" },
  { "MlsId": "12345" },
  { "MlsId": "23456" },
  { "MlsId": "23456" },
  { "MlsId": "0" },
  { "MlsId": "0" },
  { "MlsId": "" },
  { "MlsId": "" }
]

如果 MlsId 不是"或0"并且另一个列表具有相同的 MlsId,则该列表是重复的.因此,在上面的示例中,需要删除第 2 条和第 4 条记录.

A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed.

如何找到所有重复的列表并将其删除?我开始查看 MapReduce,但找不到适合我的示例.

How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case.

这是我目前所拥有的,但它不检查 MlsId 是0"还是":

Here is what I have so far, but it doesn't check if the MlsId is "0" or "":

m = function () { 
    emit(this.MlsId, 1); 
} 

r = function (k, vals) { 
   return Array.sum(vals); 
} 

res = db.Listing.mapReduce(m,r); 
db[res.result].find({value: {$gt: 1}}); 
db[res.result].drop();

推荐答案

我没用过mongoDB但是用过mapreduce.我认为您在 mapreduce 功能方面走在正确的轨道上.要排除 he 0 和空字符串,您可以在 map 函数本身中添加一个检查.. 类似于

I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like

m = function () { 
  if(this.MlsId!=0 && this.MlsId!="") {    
    emit(this.MlsId, 1); 
  }
} 

reduce 应该返回键值对.所以应该是:

And reduce should return key-value pairs. So it should be:

r = function(k, vals) {
  emit(k,Arrays.sum(vals);
}

在此之后,您应该在输出中有一组键值对,其中键是 MlsId,值是此特定 ID 出现的次数.我不确定 db.drop() 部分.正如您所指出的,它很可能会删除所有 MlsId,而不是仅删除重复的 MlsId.为了解决这个问题,也许你可以先调用 drop() 然后重新创建 MlsId 一次.这对你有用吗?

After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

这篇关于使用 MapReduce 删除重复记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆