如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录? [英] How to find duplicate records based on an id and a datetime field in MongoDB?
问题描述
我有一个包含数百万条记录的 MongoDB 集合.示例记录如下所示:
<预><代码>[{_id: ObjectId(609977b0e8e1c615cb551bf5"),活动 ID:123456789",更新日期时间:2021-03-24T20:12:02Z"},{_id: ObjectId(739177b0e8e1c615cb551bf5"),活动 ID:123456789",更新日期时间:2021-03-24T20:15:02Z"},{_id: ObjectId(805577b0e8e1c615cb551bf5"),活动 ID:123456789",更新日期时间:2021-03-24T20:18:02Z"}]多个记录可以具有相同的 activityId
,在这种情况下,我只想要具有最大 updateDateTime
的记录.
我试过这样做,它在较小的集合上运行良好,但在较大的集合上超时.
<预><代码>[{$查找:{来自:MY_TABLE",让: {现有日期:$updateDateTime",existing_sensorActivityId: "$activityId";},管道:[{$匹配:{$expr:{$和:[{ $eq: [$activityId", $$existing_sensorActivityId"] },{ $gt: [$updateDateTime", $$existing_date"] }]}}}],如:matched_records";}},{ $match: { "matched_records.0": { $exists: true } } },{ $project: { _id: 1 } }]这给了我 _id
的所有具有相同活动 ID 但较小 updateDateTime
的记录.
缓慢发生在这一步 ->matched_records.0":{$exists:true}
有没有办法加快这一步,或者有没有其他方法可以解决这个问题?
您可以使用 $out
而不是查找重复的文档并删除它们,
如何查找唯一文档?
$sort
按updateDateTime
降序$group
通过activityId
获取第一个根记录$replaceRoot
替换根目录下的记录- $out 将查询结果写入新集合
I have a MongoDB collection with millions of record. Sample records are shown below:
[
{
_id: ObjectId("609977b0e8e1c615cb551bf5"),
activityId: "123456789",
updateDateTime: "2021-03-24T20:12:02Z"
},
{
_id: ObjectId("739177b0e8e1c615cb551bf5"),
activityId: "123456789",
updateDateTime: "2021-03-24T20:15:02Z"
},
{
_id: ObjectId("805577b0e8e1c615cb551bf5"),
activityId: "123456789",
updateDateTime: "2021-03-24T20:18:02Z"
}
]
Multiple records could have the same activityId
, in this case i want just the record that has the largest updateDateTime
.
I have tried doing this and it works fine on a smaller collection but times out on a large collection.
[
{
$lookup: {
from: "MY_TABLE",
let: {
existing_date: "$updateDateTime",
existing_sensorActivityId: "$activityId"
},
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$activityId", "$$existing_sensorActivityId"] },
{ $gt: ["$updateDateTime", "$$existing_date"] }
]
}
}
}
],
as: "matched_records"
}
},
{ $match: { "matched_records.0": { $exists: true } } },
{ $project: { _id: 1 } }
]
This gives me _id
s for all the records which have the same activity id but smaller updateDateTime
.
The slowness occurs at this step -> "matched_records.0": {$exists:true}
Is there a way to speed up this step or are there any other approach to this problem?
You can find unique documents and write result in new collection using $out
instead of finding duplicate documents and deleting them,
How to find unique documents?
$sort
byupdateDateTime
in descending order$group
byactivityId
and get first root record$replaceRoot
to replace record in root- $out to write query result in new collection
[
{ $sort: { updateDateTime: -1 } },
{
$group: {
_id: "$activityId",
record: { $first: "$$ROOT" }
}
},
{ $replaceRoot: { newRoot: "$record" } },
{ $out: "newCollectionName" } // set new collection name
]
这篇关于如何根据 MongoDB 中的 id 和 datetime 字段查找重复记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!