删除 mongodb 上的重复项 [英] Remove Duplicates on mongodb

查看:36
本文介绍了删除 mongodb 上的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除 robomongo 上的重复项,我的版本是 3.0.12,所以我不能使用 DropDups,

I would like to remove duplicates on robomongo, my version 3.0.12 so I cant use DropDups,

{
    "_id" : ObjectId("id"),
    "Name" : "No One",
    "SituationDate" : "18-03-2017",
    "Situation" : "ACTIVE",
    "Region" : "13 REGION",
    "RegisterNumber" : "7649",
    "Version" : "20170517"
}

RegisterNumber 应该是唯一的,所以我想通过 RegisterNumber 删除重复项.

The RegisterNumber should be unique so I would like to remove as duplicates by the RegisterNumber.

我刚刚发现来自不同地区的人可以拥有相同的 registerNumber... 我怎样才能只删除那些 RegisterNumber 和 Region 相同的人

EDIT :I just discovered that people from different regions can have the same registerNumber... How can I remove only those who have both RegisterNumber and Region the same

解决方案:这是@Neil Lunn 给出的解决方案,稍作修改,我在一个名为 TEST 的集合中对其进行了测试,并且成功了:

Solution: Here is the solution given by @Neil Lunn with small modifications, I tested it in a collection called TEST and it worked:

var bulk = db.getCollection('TEST').initializeOrderedBulkOp();
var count = 0;

db.getCollection('TEST').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('TEST').initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();

推荐答案

如果您准备简单地丢弃所有其他重复项,那么您基本上想要 .aggregate() 以便收集具有相同 RegisterNumber 值的文档并删除除第一个匹配项之外的所有其他文档.

if you are prepared to simply discard all other duplicates then you basically want to .aggregate() in order to collect the documents with the same RegisterNumber value and remove all other documents other than the first match.

MongoDB 3.0.x 缺少一些现代助手,但是 .aggregate() 返回处理大型结果集的游标和存在 "bulk operations" 对于写入性能仍然存在:

MongoDB 3.0.x lacks some of the modern helpers but the basics that .aggregate() returns a cursor for process large result sets and the presence of "bulk operations" for write performance still exists:

var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;

db.collection.aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": "$RegisterNumber",
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.collection.initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();

在更现代的版本(3.2 及更高版本)中,最好使用 bulkWrite() 代替.请注意,这是一个客户端库"的东西,因为上面显示的相同批量"方法实际上被称为幕后":

In more modern releases ( 3.2 and above ) it is preferred to use bulkWrite() instead. Note that this is a 'client library' thing, as the same "bulk" methods shown above are actually called "under the hood":

var ops = [];

db.collection.aggregate([
  { "$group": {
    "_id": "$RegisterNumber",
    "ids": { "$push": "$id" },
    "count": { "$sum": 1 }      
  }},
  { "$match": { "count": { "$gt": 1 } } }
]).forEach( doc => {

  var keep = doc.ids.shift();

  ops = [
    ...ops,
    {
      "deleteMany": { "filter": { "_id": { "$in": doc.ids } } }
    }
  ];

  if (ops.length >= 500) {
    db.collection.bulkWrite(ops);
    ops = [];
  }
});

if (ops.length > 0)
  db.collection.bulkWrite(ops);

所以 $group通过 $RegisterNumber 值将所有内容拉到一起,并将匹配的文档 _id 值收集到一个数组中.您可以使用 $sum.

So $group pulls everything together via the $RegisterNumber value and collects the matching document _id values to an array. You keep the count of how many times this happens using $sum.

然后过滤掉所有只有 1 计数的文档,因为这些显然不是重复的.

Then filter out any documents that only had a count of 1 since those are clearly not duplicates.

传递给循环,您使用 .shift() 删除收集的密钥列表中第一次出现的 _id,只在数组中留下其他重复项".

Passing to the loop you remove the first occurance of _id in the collected list for the key with .shift(), leaving only other "duplicates" in the array.

这些通过 $ 传递给删除"操作in 作为要匹配和删除的文档的列表".

These are passed to the "remove" operation with $in as a "list" of documents to match and remove.

如果您需要更复杂的操作(例如合并来自其他重复文档的详细信息),该过程通常是相同的,只是您可能需要更加小心,例如转换唯一键"的大小写并因此实际删除在将更改写入要修改的文档之前先复制重复项.

The process is generally the same if you need something more complex such as merging details from the other duplicate documents, it's just that you might need more care if doing something like converting the case of the "unique key" and therefore actually removing the duplicates first before writing changes to the document to be modified.

无论如何,聚合将突出显示实际上是重复"的文档.剩余的处理逻辑基于您在识别出这些信息后实际想要对它们执行的任何操作.

At any rate, the aggregation will highlight the documents that actually are "duplicates". The remaining processing logic is based on whatever you actually want to do with that information once you identify them.

这篇关于删除 mongodb 上的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆