如何索引PyMongo中已知字段的未知字段? [英] How to index unknown fields of a known field in PyMongo?
问题描述
我正试图在数百万条推文中找到唯一的单词,而且我想保留每个单词出现的位置.除此之外,我还按单词的首字母对单词进行分组.这是示例代码:
I am trying to find unique words in millions of tweets and also I want to keep where each word appears. In addition to that, I am also grouping the words by their initial. Here is a sample code:
from pymongo import UpdateOne
# connect to db stuff
for word in words: # this is actually not the real loop I've used but it fits for this example
# assume tweet_id's and position is calculated here
initial = word[0]
ret = {"tweet_id": tweet_id, "pos": (beg, end)} # additional information about word
command = UpdateOne({"initial": initial}, {"$inc": {"count": 1}, "$push": {"words.%s" % word: ret}}, upsert=True)
commands.append(command)
if len(commands) % 1000 == 0:
db.tweet_words.bulk_write(commands, ordered=False)
commands = []
但是,分析所有这些推文的方法很慢.我猜我出现问题是因为我没有在words
字段上使用索引.
However, this is way slow to analyze all those tweets. I am guessing that my problem occurs because I don't use an index on words
field.
以下是文档的示例输出:
Here is an sample output of a document:
{
initial: "t"
count: 3,
words: {
"the": [{"tweet_id": <some-tweet-id>, "pos": (2, 5)},
{"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]
"turkish": [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]
}
}
我尝试使用以下代码创建索引(未成功):
I've tried to create indexes using the following codes (unsuccessfully):
db.tweet_words.create_index([("words.$**", pymongo.TEXT)])
或
db.tweet_words.create_index([("words", pymongo.HASHED)])
我遇到了add index fails, too many indexes for twitter.tweet_words
或key too large to index
之类的错误.有办法用索引做到这一点吗?还是应该改变我的方法来解决问题(也许重新设计数据库)?
I've got errors like add index fails, too many indexes for twitter.tweet_words
or key too large to index
. Is there a way to do this with indexes? Or should change my approach the problem (maybe redesign the db)?
推荐答案
要建立索引,您需要将动态数据保留在对象的值中,而不是键中.因此,我建议您重新设计架构,使其看起来像这样:
To be indexed, you need to keep your dynamic data in the values of the objects, not the keys. So I'd suggest you rework your schema to look like:
{
initial: "t"
count: 3,
words: [
{value: "the", tweets: [{"tweet_id": <some-tweet-id>, "pos": (2, 5)},
{"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]},
{value: "turkish", tweets: [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]}
]
}
然后您可以将其索引为:
Which you could then index as:
db.tweet_words.create_index([("words.value", pymongo.TEXT)])
这篇关于如何索引PyMongo中已知字段的未知字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!