删除mongodb上的重复项 [英] Remove Duplicates on mongodb

查看:76
本文介绍了删除mongodb上的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在robomongo(我的3.0.12版)上删除重复项,所以我不能使用DropDups,

I would like to remove duplicates on robomongo, my version 3.0.12 so I cant use DropDups,

{
    "_id" : ObjectId("id"),
    "Name" : "No One",
    "SituationDate" : "18-03-2017",
    "Situation" : "ACTIVE",
    "Region" : "13 REGION",
    "RegisterNumber" : "7649",
    "Version" : "20170517"
}

RegisterNumber应该是唯一的,所以我想删除RegisterNumber的重复项.

The RegisterNumber should be unique so I would like to remove as duplicates by the RegisterNumber.

我刚刚发现来自不同地区的人可以具有相同的registerNumber ...我该如何仅删除那些具有相同的RegisterNumber和Region的人

EDIT :I just discovered that people from different regions can have the same registerNumber... How can I remove only those who have both RegisterNumber and Region the same

解决方案: 这是@Neil Lunn给出的解决方案,进行了一些小修改,我在一个名为TEST的集合中对其进行了测试,并且可以正常工作:

Solution: Here is the solution given by @Neil Lunn with small modifications, I tested it in a collection called TEST and it worked:

var bulk = db.getCollection('TEST').initializeOrderedBulkOp();
var count = 0;

db.getCollection('TEST').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('TEST').initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();

推荐答案

如果您准备简单地丢弃所有其他重复项,那么您基本上想

if you are prepared to simply discard all other duplicates then you basically want to .aggregate() in order to collect the documents with the same RegisterNumber value and remove all other documents other than the first match.

MongoDB 3.0.x缺少一些现代的帮助器,但是批量操作" 仍然存在以提高写入性能:

MongoDB 3.0.x lacks some of the modern helpers but the basics that .aggregate() returns a cursor for process large result sets and the presence of "bulk operations" for write performance still exists:

var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;

db.collection.aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": "$RegisterNumber",
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.collection.initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();

在更现代的发行版(3.2及更高版本)中,首选使用

In more modern releases ( 3.2 and above ) it is preferred to use bulkWrite() instead. Note that this is a 'client library' thing, as the same "bulk" methods shown above are actually called "under the hood":

var ops = [];

db.collection.aggregate([
  { "$group": {
    "_id": "$RegisterNumber",
    "ids": { "$push": "$id" },
    "count": { "$sum": 1 }      
  }},
  { "$match": { "count": { "$gt": 1 } } }
]).forEach( doc => {

  var keep = doc.ids.shift();

  ops = [
    ...ops,
    {
      "deleteMany": { "filter": { "_id": { "$in": doc.ids } } }
    }
  ];

  if (ops.length >= 500) {
    db.collection.bulkWrite(ops);
    ops = [];
  }
});

if (ops.length > 0)
  db.collection.bulkWrite(ops);

因此 $group 通过$RegisterNumber值,并将匹配的文档_id值收集到数组中.您可以使用 $sum .

So $group pulls everything together via the $RegisterNumber value and collects the matching document _id values to an array. You keep the count of how many times this happens using $sum.

然后过滤掉所有计数为1的文档,因为这些文档显然不是重复的.

Then filter out any documents that only had a count of 1 since those are clearly not duplicates.

通过循环,您可以在.shift()的键的收集列表中删除_id的第一个匹配项,而在数组中仅保留其他重复项".

Passing to the loop you remove the first occurance of _id in the collected list for the key with .shift(), leaving only other "duplicates" in the array.

这些通过 $in 作为要匹配和删除的文档的列表".

These are passed to the "remove" operation with $in as a "list" of documents to match and remove.

如果您需要更复杂的操作(例如,合并其他重复文档中的详细信息),则过程通常是相同的,只是您可能需要更仔细地进行诸如转换唯一键"的大小写等操作,因此实际上将其删除在将更改写入要修改的文档之前,先进行重复.

The process is generally the same if you need something more complex such as merging details from the other duplicate documents, it's just that you might need more care if doing something like converting the case of the "unique key" and therefore actually removing the duplicates first before writing changes to the document to be modified.

无论如何,聚合将突出显示实际上是重复项"的文档.剩下的处理逻辑是基于您一旦识别出该信息实际上想要做的事情.

At any rate, the aggregation will highlight the documents that actually are "duplicates". The remaining processing logic is based on whatever you actually want to do with that information once you identify them.

这篇关于删除mongodb上的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆