快速拆分为R中的替代方法 [英] Fast alternative to split in R

查看:83
本文介绍了快速拆分为R中的替代方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用split()对数据帧进行分区,以便使用parLapply()在每个分区上并行调用一个函数.数据帧有130万行和20个列.我按两列(均是字符类型)进行拆分/分区.看起来有〜47K唯一ID和〜12K唯一代码,但并非ID和代码的每对都匹配.分区的总数约为250K.这是split()行:

I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line:

 system.time(pop_part <- split(pop, list(pop$ID, pop$code)))

然后将分区按以下方式馈入parLapply():

The partitions will then be fed into parLapply() as follows:

cl <- makeCluster(detectCores())
system.time(par_pop <- parLapply(cl, pop_part, func))
stopCluster(cl)

我已经让split()代码单独运行了将近一个小时,但并没有完成.我可以单独使用ID进行拆分,这大约需要10分钟.此外,R studio和辅助线程正在消耗约6GB的RAM.

I've let the split() code alone run almost an hour and it doesn't complete. I can split by the ID alone, which takes ~10 mins. Additionally, R studio and the worker threads are consuming ~6GB of RAM.

我知道分区数量的原因是我在Pentaho数据集成(PDI)中具有等效的代码,该代码可以在30秒内运行(对于整个程序,而不仅仅是拆分"代码).我不希望R具有这种性能,但是某些情况可能会在最坏的情况下10到15分钟内完成.

The reason I know the resulting number of partitions is I have equivalent code in Pentaho Data Integration (PDI) that runs in 30 seconds (for the entire program, not just the "split" code). I'm not hoping for that type of performance with R, but something that perhaps completes in 10 - 15 mins worst case.

主要问题:是否有更好的替代选择?我还尝试过将ddply().parallel = TRUE一起使用,但是它也运行了一个多小时,并且从未完成.

The main question: Is there a better alternative to split? I've also tried ddply() with .parallel = TRUE, but it also ran over an hour and never completed.

推荐答案

将索引拆分为pop

idx <- split(seq_len(nrow(pop)), list(pop$ID, pop$code))

分割速度不慢,例如

> system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
   user  system elapsed 
  1.056   0.000   1.058 

因此,如果您是我的数据,那么我想您的数据有些方面会减慢速度,例如,IDcode都是具有多个级别的因素,因此它们完全相互作用,而不是出现在数据中的级别组合设置,计算出

so if yours is I guess there's some aspect of your data that slows things down, e.g., ID and code are both factors with many levels and so their complete interaction, rather than the level combinations appearing in your data set, are calculated

> length(split(1:10, list(factor(1:10), factor(10:1))))
[1] 100
> length(split(1:10, paste(letters[1:10], letters[1:10], sep="-")))
[1] 10

或者可能是内存不足.

如果要在非Windows计算机上使用进程,请使用mclapply而不是parLapply(我想是这样,因为您要求使用detectCores()).

Use mclapply rather than parLapply if you're using processes on a non-Windows machine (which I guess is the case since you ask for detectCores()).

par_pop <- mclapply(idx, function(i, pop, fun) fun(pop[i,]), pop, func)

从概念上讲,这听起来像是您真正的目标是pvec(在处理器上分配矢量化计算),而不是mclapply(在数据框中的各个行上进行迭代).

Conceptually it sounds like you're really aiming for pvec (distribute a vectorized calculation over processors) rather than mclapply (iterate over individual rows in your data frame).

而且,实际上,作为第一步,请考虑在func中确定瓶颈.数据很大但不是很大,因此也许不需要并行评估-也许您编写的是PDI代码而不是R代码?注意数据帧中的数据类型,例如,因子对字符.在写得不好和高效的R代码之间实现100倍的提速并不罕见,而并行评估最多与内核数成正比.

Also, and really as the initial step, consider identifying the bottle necks in func; the data is large but not that big so perhaps parallel evaluation is not needed -- maybe you've written PDI code instead of R code? Pay attention to data types in the data frame, e.g., factor versus character. It's not unusual to get a 100x speed-up between poorly written and efficient R code, whereas parallel evaluation is at best proportional to the number of cores.

这篇关于快速拆分为R中的替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆