确定性地排序MongoDB集合(添加新的ObjectID字段) [英] Sorting MongoDB collection deterministically (add new ObjectID field)

查看:196
本文介绍了确定性地排序MongoDB集合(添加新的ObjectID字段)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个MongoDB项目,该项目存储tweets并由其他人创建。
这个人决定在MongoDB中使用Twitter的tweet ID作为 _id 字段,这意味着我现在没有办法确定性地对tweet进行排序。

I'm working on a MongoDB project which stores tweets and was created by someone else. This person decided to use the Twitter tweet ID for the _id field in MongoDB, which means I now have no way to sort the tweets deterministically.

示例:

> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(5)}

字段ID上排序的原因是非确定性的,我的系统可以向数据库添加ID为4的现有tweet,这意味着相同的命令将给出不同的结果集:

The reason sorting on the field ID is non-deterministic is that at a later date, my system could add the existing tweet that has an ID of 4 to the database, meaning that the same command would give a different result set:

> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(4)}

我的问题是:有一种方法来添加一个新的字段在一个集合中,值为 ObjectID ,这样我可以排序吗?
或者如果没有,建议将 _id 字段重命名为 tweetId 然后让 _id 字段类型 ObjectID

My question is: is there a way to add a new 'field' to every entry in a collection, with a value of type ObjectID, so that I can sort on that? Or if not, what would the recommendations be for 'renaming' the _id field to say tweetId and then making the _id field of type ObjectID

谢谢

推荐答案

Shawn链接的帖子中的一些片段有几个缺陷。虽然想法是正确的,使用命令行 mongo 可能会导致几个问题。

Some of the snippets in the post that Shawn linked to had several flaws. Whilst the idea was right, using the command line mongo could cause several problems.

获取快照在 mongo 中添加任何新的tweets之前的所有tweet都很困难。我可以找到的唯一方法是使用:

Getting a 'snapshot' of all the tweets before any new ones are added is difficult in mongo. The only way I could find to do it was to use:

$ db.tweets.find({},{_id:1}) .toArray()

或者

$ db.tweets.distinct('_ id')

不幸的是,由于我有超过200万条tweets在我的数据库,这导致 mongo 会耗尽内存。我有一个异常:distinct too big,16mb cap errir,
相反我使用Python,这里是脚本:

Unfortunately, as I had over 2 million tweets in my database this caused mongo to run out of memory. I got a "exception: distinct too big, 16mb cap" errir, Instead I used Python, here's the script:

#!/usr/bin/env python

"""A tool to work through all tweets, and convert the '_id'
from the Tweet ID into an ObjectID (saving the tweet)
ID in the 'tweetID' field
"""
import pymongo
from bson.objectid import ObjectId

if __name__ == "__main__":
    client = pymongo.MongoClient()
    db = client.guaiamum

    ids = list(t['_id'] for t in db.tweets.find({'_id': {'$type' : 18}}, {'_id' : 1}))
    for _id in ids:
        tweet = db.tweets.find_one({'_id' : _id})
        tweet['_id'] = ObjectId()
        tweet['twitterId'] = _id
        db.tweets.insert(tweet)
        db.tweets.remove(_id, multi=False)

花了1.5小时运行,但奇怪的是,仍然比使用 mongo

It still took a good 1.5 hrs to run, but bizarrely that's still much quicker than using mongo

这篇关于确定性地排序MongoDB集合(添加新的ObjectID字段)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆