pymongo:删除重复项(缩小地图?) [英] pymongo: remove duplicates (map reduce?)

查看:101
本文介绍了pymongo:删除重复项(缩小地图?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我确实有一个包含多个集合的数据库(总共约1500万个文档),并且文档看起来像这样(简化):

I do have a Database with several collections (overall ~15mil documents) and documents look like this (simplified):

{'Text': 'blabla', 'ID': 101}
{'Text': 'Whuppppyyy', 'ID': 102}
{'Text': 'Abrakadabraaa', 'ID': 103}
{'Text': 'olalalaal', 'ID': 104}
{'Text': 'test1234545', 'ID': 104}
{'Text': 'whapwhapwhap', 'ID': 104}

它们都具有唯一的_id字段,但是我想删除重复到另一个字段(外部ID字段)的重复项.

They all have an unique _id field as well, but I want to delete duplicates accodring to another field (the external ID field).

首先,我尝试了一种非常手动的方法,该方法是使用列表并随后删除,但是数据库似乎太大,需要很长时间并且不切实际.

First, I tried a very manual approach with lists and deleting afterwards, but the DB seems too big, takes very long and is not practical.

第二,以下内容在当前的MongoDB版本中不再起作用,即使有人建议这样做.

Second, the following does not work in current MongoDB versions anymore, even though anyone is suggesting it.

db.collection.ensureIndex( { ID: 1 }, { unique: true, dropDups: true } )

因此,现在我试图创建一个map reduce解决方案,但是我真的不知道Im在做什么,尤其是在使用另一个字段(不是数据库_id)查找和删除重复项时遇到了困难.这是我不好的第一方法(从某些内部资源采用):

So, now I'm trying to create a map reduce solution, but I dont really know what Im doing and especially have difficulty using another field (not the database _id) to find and delete duplicates. Here is my bad first approach (adopted from some interent source):

map = Code("function(){ if(this.fieldName){emit(this.fieldName,1);}}")
reduce = Code("function(key,values) {return Array.sum(values);}")
res = coll.map_reduce(map,reduce,"my_results");

response = []
for doc in res.find():
    if(doc['value'] > 1):
        count = int(doc['value']) - 1
        docs = col.find({"fieldName":doc['ID']},{'ID':1}).limit(count)
        for i in docs:
            response.append(i['ID'])

coll.remove({"ID": {"$in": response}})

任何有助于减少外部ID字段中重复项(保留一个条目)的帮助,都会非常感谢;)谢谢!

Any help to reduce any duplicates in the external ID field (leaving one entry), would be very much apprechiated ;) Thanks!

推荐答案

另一种方法是使用 $group 运算符通过ID字段对文档进行分组,并使用 _id值存储在unique_ids字段中="https://docs.mongodb.org/manual/reference/operator/aggregation/addToSet/#grp._S_addToSet" rel ="noreferrer"> $addToSet 运算符. $sum >累加器运算符将传递给它的字段的值加起来,在这种情况下为常数1-从而将分组记录的数量计数到count字段中.另一个管道步骤 $match 过滤计数至少为2(即重复)的文档.

An alternative approach is to use the aggregation framework which has better performance than map-reduce. Consider the following aggregation pipeline which as the first stage of the aggregation pipeline, the $group operator groups documents by the ID field and stores in the unique_ids field each _id value of the grouped records using the $addToSet operator. The $sum accumulator operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field. The other pipeline step $match filters documents with a count of at least 2, i.e. duplicates.

从聚合中获得结果后,您将迭代光标以删除unique_ids字段中的第一个_id,然后将其余部分推入一个数组中,稍后将用于删除重复项(减去一个条目) ):

Once you get the result from the aggregation, you iterate the cursor to remove the first _id in the unique_ids field, then push the rest into an array that will be used later to remove the duplicates (minus one entry):

cursor = db.coll.aggregate(
    [
        {"$group": {"_id": "$ID", "unique_ids": {"$addToSet": "$_id"}, "count": {"$sum": 1}}},
        {"$match": {"count": { "$gte": 2 }}}
    ]
)

response = []
for doc in cursor:
    del doc["unique_ids"][0]
    for id in doc["unique_ids"]:
        response.append(id)

coll.remove({"_id": {"$in": response}})

这篇关于pymongo:删除重复项(缩小地图?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆