如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录? [英] How to find duplicate records based on an id and a datetime field in MongoDB?

查看:93
本文介绍了如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数百万条记录的 MongoDB 集合.示例记录如下所示:

<预><代码>[{_id: ObjectId(609977b0e8e1c615cb551bf5"),活动 ID:123456789",更新日期时间:2021-03-24T20:12:02Z"},{_id: ObjectId(739177b0e8e1c615cb551bf5"),活动 ID:123456789",更新日期时间:2021-03-24T20:15:02Z"},{_id: ObjectId(805577b0e8e1c615cb551bf5"),活动 ID:123456789",更新日期时间:2021-03-24T20:18:02Z"}]

多个记录可以具有相同的 activityId,在这种情况下,我只想要具有最大 updateDateTime 的记录.

我试过这样做,它在较小的集合上运行良好,但在较大的集合上超时.

<预><代码>[{$查找:{来自:MY_TABLE",让: {现有日期:$updateDateTime",existing_sensorActivityId: "$activityId";},管道:[{$匹配:{$expr:{$和:[{ $eq: [$activityId", $$existing_sensorActivityId"] },{ $gt: [$updateDateTime", $$existing_date"] }]}}}],如:matched_records";}},{ $match: { "matched_records.0": { $exists: true } } },{ $project: { _id: 1 } }]

这给了我 _id 的所有具有相同活动 ID 但较小 updateDateTime 的记录.

缓慢发生在这一步 ->matched_records.0":{$exists:true}

有没有办法加快这一步,或者有没有其他方法可以解决这个问题?

解决方案

您可以使用 $out 而不是查找重复的文档并删除它们,

如何查找唯一文档?

  • $sortupdateDateTime 降序
  • $group 通过 activityId 获取第一个根记录
  • $replaceRoot 替换根目录下的记录
  • $out 将查询结果写入新集合
<预><代码>[{ $sort: { updateDateTime: -1 } },{$组:{_id: "$activityId",记录:{ $first:"$$ROOT"}}},{ $replaceRoot: { newRoot: "$record";} },{ $out: "newCollectionName";}//设置新的集合名称]

游乐场

I have a MongoDB collection with millions of record. Sample records are shown below:

[
  {
    _id: ObjectId("609977b0e8e1c615cb551bf5"),
    activityId: "123456789",
    updateDateTime: "2021-03-24T20:12:02Z"
  },
  {
    _id: ObjectId("739177b0e8e1c615cb551bf5"),
    activityId: "123456789",
    updateDateTime: "2021-03-24T20:15:02Z"
  },
  {
    _id: ObjectId("805577b0e8e1c615cb551bf5"),
    activityId: "123456789",
    updateDateTime: "2021-03-24T20:18:02Z"
  }
]

Multiple records could have the same activityId, in this case i want just the record that has the largest updateDateTime.

I have tried doing this and it works fine on a smaller collection but times out on a large collection.

[
  {
    $lookup: {
      from: "MY_TABLE",
      let: {
        existing_date: "$updateDateTime",
        existing_sensorActivityId: "$activityId"
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ["$activityId", "$$existing_sensorActivityId"] },
                { $gt: ["$updateDateTime", "$$existing_date"] }
              ]
            }
          }
        }
      ],
      as: "matched_records"
    }
  },
  { $match: { "matched_records.0": { $exists: true } } },
  { $project: { _id: 1 } }
]

This gives me _ids for all the records which have the same activity id but smaller updateDateTime.

The slowness occurs at this step -> "matched_records.0": {$exists:true}

Is there a way to speed up this step or are there any other approach to this problem?

解决方案

You can find unique documents and write result in new collection using $out instead of finding duplicate documents and deleting them,

How to find unique documents?

  • $sort by updateDateTime in descending order
  • $group by activityId and get first root record
  • $replaceRoot to replace record in root
  • $out to write query result in new collection

[
  { $sort: { updateDateTime: -1 } },
  {
    $group: {
      _id: "$activityId",
      record: { $first: "$$ROOT" }
    }
  },
  { $replaceRoot: { newRoot: "$record" } },
  { $out: "newCollectionName" } // set new collection name
]

Playground

这篇关于如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆