使用plyr,doMC和summarize()与非常大的数据集? [英] Using plyr, doMC, and summarise() with very big dataset?

查看:368
本文介绍了使用plyr,doMC和summarize()与非常大的数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集(〜1.4m行),我正在做一些拆分和总结。整个事情需要一段时间运行,我的最终应用程序依赖于频​​繁运行,所以我的想法是使用 doMC .parallel = TRUE 使用plyr标记的标记如下所示(简化了一下):

I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit):

library(plyr)
require(doMC)
registerDoMC()

df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

如果我将内核显式设置为两个(使用 registerDoMC (核心= 2))我的8 GB的RAM看透我,它刮了相当多的时间。然而,如果我让它使用所有8个内核,我很快耗尽内存,因为每个分支进程似乎克隆整个数据集在内存中。

If I set the number of cores explicitly to two (using registerDoMC(cores=2)) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.

我的问题是,是否可以使用plyr的并行执行设施在更节省内存的方式?我尝试将我的数据帧转换为 big.matrix ,但这只是似乎迫使整个事情回到使用单个核心:

My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix, but this simply seemed to force the whole thing back to using a single core:

library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)

bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

这是我第一次进入多核R计算,所以如果有更好的思考方式,

This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.

更新:与生活中的许多事情一样,原来我在做其他愚蠢的事情在我的代码的其他地方,并且多处理的整个问题在这个特定实例中变成了问题点。但是,对于大数据折叠任务,我将保留 data.table 。我能够以一种直接的方式复制我的折叠任务。

UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table in mind. I was able to replicate my folding task in a straightforward way.

推荐答案

我不认为plyr复制整个数据集。但是,当处理数据块时,该子集将复制到工作线程。因此,当使用更多的工作者时,更多的子集同时在内存中(即8而不是2)。

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

我可以想到几个你可以尝试的提示:

I can think of a few tips you could try:


  • 将数据放入数组结构中,而不是data.frame中,并使用adply进行汇总。数组在内存使用和速度方面更加高效。

  • 尝试使用 data.table ,在某些情况下,这可能会导致速度增加几个数量级。我不知道data.table是否支持并行处理,但即使没有并行化,data.table可能会更快的时间。请参阅我的博客帖子 a>比较 ave ddply data.table 处理数据块。

  • Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
  • Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.

这篇关于使用plyr,doMC和summarize()与非常大的数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆