从MongoDB 4.2数据库中删除重复项 [英] Remove duplicates from MongoDB 4.2 data base
问题描述
我正在尝试从MongoDB中删除重复项,但是所有解决方案都失败了。
我的JSON结构:
I am trying to remove duplicates from MongoDB but all solutions find fail. My JSON structure:
{
"_id" : ObjectId("5d94ad15667591cf569e6aa4"),
"a" : "aaa",
"b" : "bbb",
"c" : "ccc",
"d" : "ddd",
"key" : "057cea2fc37aabd4a59462d3fd28c93b"
}
键值是md5(a + b + c + d)。
我已经有一个拥有超过10亿条记录的数据库,我想根据键并在使用唯一索引后删除所有重复项,因此如果键已在数据库中,则记录将不会再次插入。
Key value is md5(a+b+c+d). I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again.
我已经尝试过
db.data.ensureIndex( { key:1 }, { unique:true, dropDups:true } )
但是对于我了解的内容,在MongoDB> 3.0中删除了dropDups
But for what I understand dropDups were removed in MongoDB > 3.0.
我也尝试了几种Java脚本代码,例如:
I tried also several of java script codes like:
var duplicates = [];
db.data.aggregate([
{ $match: {
key: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { key: "$key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
).forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
,但失败:
QUERY [Js] uncaught exception: Error: command failed: {
"ok": 0,
"errmsg" : "assertion src/mongo/db/pipeline/value.cpp:1365".
"code" : 8,
"codeName" : "UnknownError"
} : aggregate failed
我没有使用默认设置更改MongoDB设置。
I haven't change MongoDB settings, working with the default settings.
推荐答案
这是我的输入集合 dups
,其中包含一些重复的数据( k
值 11
和 22
):
This is my input collection dups
, with some duplicate data (k
with values 11
and 22
):
{ "_id" : 1, "k" : 11 }
{ "_id" : 2, "k" : 22 }
{ "_id" : 3, "k" : 11 }
{ "_id" : 4, "k" : 44 }
{ "_id" : 5, "k" : 55 }
{ "_id" : 6, "k" : 66 }
{ "_id" : 7, "k" : 22 }
{ "_id" : 8, "k" : 88 }
{ "_id" : 9, "k" : 11 }
查询将删除重复项:
db.dups.aggregate([
{ $group: {
_id: "$k",
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $project: { k: "$_id", _id: { $arrayElemAt: [ "$dups", 0 ] } } }
] )
=>
{ "k" : 88, "_id" : 8 }
{ "k" : 22, "_id" : 7 }
{ "k" : 44, "_id" : 4 }
{ "k" : 55, "_id" : 5 }
{ "k" : 66, "_id" : 6 }
{ "k" : 11, "_id" : 9 }
如您所见,以下重复数据已删除:
{ "_id" : 1, "k" : 11 }
{ "_id" : 2, "k" : 22 }
{ "_id" : 3, "k" : 11 }
以数组形式获取结果:
var arr = db.dups.aggregate([ ...] ).toArray()
arr
具有文档数组:
[
{
"k" : 88,
"_id" : 8
},
{
"k" : 22,
"_id" : 7
},
{
"k" : 44,
"_id" : 4
},
{
"k" : 55,
"_id" : 5
},
{
"k" : 66,
"_id" : 6
},
{
"k" : 11,
"_id" : 9
}
]
这篇关于从MongoDB 4.2数据库中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!