MongoDB:不能使用游标来遍历所有的数据 [英] MongoDB: cannot use a cursor to iterate through all the data

查看:601
本文介绍了MongoDB:不能使用游标来遍历所有的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新时更新:



已解决!请参阅:
MongoDB:无法使用游标迭代所有数据(因为数据已损坏)



它是由损坏的数据集造成的。不是MongoDB或驱动程序。



============================= ================================================

我使用MongoDB的最新Java驱动程序(2.11.3)(2.4.6)。我有一个收集约〜250M的记录,我想使用一个游标来遍历所有的。但是,在10分钟左右后,我得到一个错误的cursor.hasNext()或一个异常说,游标不存在于服务器上。



关于游标超时和包装我的cursor.next()与try / catch。如果任何异常或hasNext()在遍历所有记录之前返回false,程序关闭游标并分配一个新的,然后跳回到上下文。



但后来我读到关于cursor.skip()性能问题...和程序刚刚达到〜20M记录,cursor.next()后cursor.skip()抛出java.util.NoSuchElementException。我相信这是因为跳过操作已经超时,这会使游标失效。



是的,我已经阅读skip()性能问题和游标超时问题...但现在我想我处在一个两难的地方,固定一个会打破另一个。



那么,是否有一种方法可以优雅地遍历巨大数据集中的所有数据?



@mnemosyn已经指出,我不得不依赖基于范围的查询。但问题是,我想把所有的数据分成16个部分,并在不同的机器上处理,并且数据不是均匀分布在任何单调的关键空间。如果需要负载平衡,必须有一种方法来计算在特定范围内有多少个键,并平衡它们。我的目标是将它们分成16个部分,所以我必须找到四分位数的四分位数(对不起,我不知道是否有这个数学术语)的键,并使用它们来分割数据。



有没有办法实现这一点?



我有一些想法,当第一次寻求是通过获得分区边界键。如果新光标再次超时,我可以简单地记录最新的tweetID并重新回到新的范围。然而,范围查询应该足够快,否则我仍然得到超时。我不相信这个...



更新:



问题解决!我没有意识到,我不必分区数据以一种chunky的方式。一个循环调度作业调度程序会做。请参阅接受的答案中的评论。

解决方案

一般来说,是的。如果你有一个单调的字段,理想的索引字段,你可以简单地沿着那个。例如,如果您使用 ObjectId 类型的字段作为主键,或者如果您有 CreatedDate 您可以简单地使用 $ lt 查询,获取固定数量的元素,然后使用 $ lt



p>注意严格的单调行为与非严格的单调行为:如果键不严格,可能必须使用 $ lte ,然后防止做两次复制。由于 _id 字段是唯一的,因此 ObjectIds 始终严格单调。



如果你没有这样的钥匙,事情有点更棘手。你仍然可以沿着索引(无论索引,无论是名称,哈希,UUID,Guid等)迭代。这也工作得很好,但是很难做快照,因为你永远不知道你刚刚发现的结果是否在开始遍历之前插入。此外,当遍历开始时插入文档时,这些文档将被遗漏。


Update on update:

Solved! See this: MongoDB: cannot iterate through all data with cursor (because data is corrupted)

It's caused by corrupted data set. Not MongoDB or the driver.

=========================================================================

I'm using the latest Java driver(2.11.3) of MongoDB(2.4.6). I've got a collection with ~250M records and I want to use a cursor to iterate through all of them. However, after 10 minutes or so I got either a false cursor.hasNext(), or an exception saying that the cursor does not exist on server.

After that I learned about cursor timeout and wrapped my cursor.next() with try/catch. If any exception, or hasNext() returned false before iterating through all the records, the program closes the cursor and allocates a new one, and then skip right back into context.

But later on I read about cursor.skip() performance issues... And the program just reached ~20M records, and cursor.next() after cursor.skip() throwed out "java.util.NoSuchElementException". I believe that's because the skip operation has timed out, which invalidated the cursor.

Yes I've read about skip() performance issues and cursor timeout issues... But now I think I'm in a dilemma where fixing one will break the other.

So, is there a way to gracefully iterate through all the data in a huge dataset?

@mnemosyn has already pointed out that I have to rely on range-based queries. But the problem is that I want to split all the data into 16 parts and process them on different machines, and the data is not uniformly distributed within any monotonic key space. If load balancing is desired, there must be a way to calculate how many keys are in a particular range and balance them. My goal is to partition them into 16 parts, so I have to find the quartiles of quartiles (sorry, I don't know if there's a mathematical term for this) of the keys and use them to split data.

Is there a way to achieve this?

I do have some ideas when the first seek is achieved by obtaining the partition boundary keys. If the new cursor times out again, I can simply record the latest tweetID and jump back in with the new range. However, the range query should be fast enough or otherwise I still get timeouts. I'm not confident about this...

Update:

Problem solved! I didn't realise that I don't have to partition data in a chunky way. A round-robin job dispatcher will do. See comments in the accepted answer.

解决方案

In general, yes. If you have a monotonic field, ideally an indexed field, you can simply walk along that. For instance, if you're using fields of type ObjectId as primary key or if you have a CreatedDate or something, you can simply use an $lt query, take a fixed number of elements, then query again using $lt of the smallest _id or CreatedDate you encountered in the previous batch.

Be careful about strict monotonic behavior vs. non-strict monotonic behavior: You might have to use $lte if the keys aren't strict, then prevent doing things twice on the dupes. Since the _id field is unique, ObjectIds are always strictly monotonic.

If you don't have such a key, things are a little more tricky. You can still iterate 'along the index' (whatever index, be it a name, a hash, a UUID, Guid, etc.). That works just as well, but it's hard to do snapshotting, because you never know whether the result you have just found was inserted before you started to traverse, or not. Also, when documents are inserted at the beginning of the traversal, those will be missed.

这篇关于MongoDB:不能使用游标来遍历所有的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆