使用rmongodb加快大型结果集处理速度 [英] speed up large result set processing using rmongodb

查看:310
本文介绍了使用rmongodb加快大型结果集处理速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用rmongodb来获取特定集合中的每个文档。它的工作原理,但我正在与数百万的小文件,可能100M或更多。我正在使用作者在网站上建议的方法:cnub.org/rmongodb.ashx

  count<  -  mongo .count(mongo,ns,query)
cursor< - mongo.find(mongo,query)
name< - vector(character,count)
age< (numeric,count)
i < - 1
while(mongo.cursor.next(cursor)){
b< - mongo.cursor.value(cursor)
name [i]< - mongo.bson.value(b,name)
age [i]< - mongo.bson.value(b,age)
i < + 1
}
df< - as.data.frame(list(name = name,age = age))

这对于数以百计的结果很好,但是循环非常慢。有没有办法加速呢?也许是多处理的机会?任何建议,将不胜感激。我每小时平均有1M,这个速度我需要一个星期才能建立数据框架。



编辑:
我注意到while循环中的向量越多越慢。我现在试图为每个向量单独循环。仍然似乎是一个黑客,但一定有一个更好的方法。



编辑2:
我有一些运气与data.table。它仍然运行,但它似乎将在4小时内完成12M(这是我目前的测试集),这是进步,但远非理想

  dt<  -  data.table(uri = rep(NA,count),
time = rep(0,count),
action = rep(NA ,
bytes = rep(0,count),
dur = rep(0,count))

while(mongo.cursor.next(cursor)){
b< - mongo.cursor.value(cursor)
set(dt,i,1L,mongo.bson.value(b,cache))
set(dt,i,2L,mongo
set(dt,i,3L,mongo.bson.value(b,time))
set(dt,i,4L,mongo (b,bytes))
set(dt,i,5L,mongo.bson.value(b,elaps))
pre>

}

解决方案

您可能想尝试 mongo.find.exhaust 选项

  cursor<  -  mongo.find(mongo ,query,options = [mongo.find.exhaust])

如果实际适用于您的用例,这将是最简单的修复。



然而,rmongodb驱动程序似乎缺少其他可用的额外功能司机。例如,JavaScript驱动程序有一个 Cursor.toArray 方法。它将所有查找结果直接转储到数组。 R驱动程序有一个 mongo.bson.to.list 函数,但是一个 mongo.cursor.to.list 是可能是你想要的值得一提的是,驱动程序开发人员可能会提出建议。



一个黑客解决方案可能是创建一个新的集合,其文档是每个原始文档的数据块。然后这些可以用 mongo.bson.to.list 来高效地读取。可以使用mongo服务器MapReduce功能构建分块集合。


I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx

count <- mongo.count(mongo, ns, query)
cursor <- mongo.find(mongo, query)
name <- vector("character", count)
age <- vector("numeric", count)
i <- 1
while (mongo.cursor.next(cursor)) {
    b <- mongo.cursor.value(cursor)
    name[i] <- mongo.bson.value(b, "name")
    age[i] <- mongo.bson.value(b, "age")
    i <- i + 1
}
df <- as.data.frame(list(name=name, age=age))

This works fine for hundreds or thousands of results but that while loop is VERY VERY slow. Is there some way to speed this up? Maybe an opportunity for multiprocessing? Any suggestions would be appreciated. I'm averaging 1M per hour and at this rate I'll need a week just to build the data frame.

EDIT: I've noticed that the more vectors in the while loop the slower it gets. I'm now trying to loop separately for each vector. Still seems like a hack though, there must be a better way.

Edit 2: I'm having some luck with data.table. Its still running but it looks like it will finish the 12M (this is my current test set) in 4 hours, that's progress but far from ideal

dt <- data.table(uri=rep("NA",count),
                 time=rep(0,count),
                 action=rep("NA",count),
                 bytes=rep(0,count),
                 dur=rep(0,count))

while (mongo.cursor.next(cursor)) {
  b <- mongo.cursor.value(cursor)
  set(dt, i, 1L,  mongo.bson.value(b, "cache"))
  set(dt, i, 2L,  mongo.bson.value(b, "path"))
  set(dt, i, 3L,  mongo.bson.value(b, "time"))
  set(dt, i, 4L,  mongo.bson.value(b, "bytes"))
  set(dt, i, 5L,  mongo.bson.value(b, "elaps"))

}

解决方案

You might want to try the mongo.find.exhaust option

cursor <- mongo.find(mongo, query, options=[mongo.find.exhaust])

This would be the easiest fix if actually works for your use case.

However the rmongodb driver seems to be missing some extra features available on other drivers. For example the JavaScript driver has a Cursor.toArray method. Which directly dumps all the find results to an array. The R driver has a mongo.bson.to.list function, but a mongo.cursor.to.list is probably what you want. It's probably worth pinging the driver developer for advice.

A hacky solution could be to create a new collection whose documents are data "chunks" of 100000 of the original documents each. Then these each of these could be efficiently read with mongo.bson.to.list. The chunked collection could be constructed using the mongo server MapReduce functionality.

这篇关于使用rmongodb加快大型结果集处理速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆