确定性地排序MongoDB集合(添加新的ObjectID字段) [英] Sorting MongoDB collection deterministically (add new ObjectID field)
问题描述
我正在处理一个MongoDB项目,该项目存储tweets并由其他人创建。
这个人决定在MongoDB中使用Twitter的tweet ID作为 _id
字段,这意味着我现在没有办法确定性地对tweet进行排序。
I'm working on a MongoDB project which stores tweets and was created by someone else.
This person decided to use the Twitter tweet ID for the _id
field in MongoDB, which means I now have no way to sort the tweets deterministically.
示例:
> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(5)}
字段ID上排序的原因是非确定性的,我的系统可以向数据库添加ID为4的现有tweet,这意味着相同的命令将给出不同的结果集:
The reason sorting on the field ID is non-deterministic is that at a later date, my system could add the existing tweet that has an ID of 4 to the database, meaning that the same command would give a different result set:
> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(4)}
我的问题是:有一种方法来添加一个新的字段在一个集合中,值为 ObjectID
,这样我可以排序吗?
或者如果没有,建议将 _id
字段重命名为 tweetId
然后让 _id
字段类型 ObjectID
My question is: is there a way to add a new 'field' to every entry in a collection, with a value of type ObjectID
, so that I can sort on that?
Or if not, what would the recommendations be for 'renaming' the _id
field to say tweetId
and then making the _id
field of type ObjectID
谢谢
推荐答案
Shawn链接的帖子中的一些片段有几个缺陷。虽然想法是正确的,使用命令行 mongo
可能会导致几个问题。
Some of the snippets in the post that Shawn linked to had several flaws. Whilst the idea was right, using the command line mongo
could cause several problems.
获取快照在 mongo
中添加任何新的tweets之前的所有tweet都很困难。我可以找到的唯一方法是使用:
Getting a 'snapshot' of all the tweets before any new ones are added is difficult in mongo
. The only way I could find to do it was to use:
$ db.tweets.find({},{_id:1}) .toArray()
或者
$ db.tweets.distinct('_ id')
不幸的是,由于我有超过200万条tweets在我的数据库,这导致 mongo
会耗尽内存。我有一个异常:distinct too big,16mb cap
errir,
相反我使用Python,这里是脚本:
Unfortunately, as I had over 2 million tweets in my database this caused mongo
to run out of memory. I got a "exception: distinct too big, 16mb cap"
errir,
Instead I used Python, here's the script:
#!/usr/bin/env python
"""A tool to work through all tweets, and convert the '_id'
from the Tweet ID into an ObjectID (saving the tweet)
ID in the 'tweetID' field
"""
import pymongo
from bson.objectid import ObjectId
if __name__ == "__main__":
client = pymongo.MongoClient()
db = client.guaiamum
ids = list(t['_id'] for t in db.tweets.find({'_id': {'$type' : 18}}, {'_id' : 1}))
for _id in ids:
tweet = db.tweets.find_one({'_id' : _id})
tweet['_id'] = ObjectId()
tweet['twitterId'] = _id
db.tweets.insert(tweet)
db.tweets.remove(_id, multi=False)
花了1.5小时运行,但奇怪的是,仍然比使用 mongo
It still took a good 1.5 hrs to run, but bizarrely that's still much quicker than using mongo
这篇关于确定性地排序MongoDB集合(添加新的ObjectID字段)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!