像expand.grid这样的组合迭代器 [英] Combinatorial iterator like expand.grid

查看:107
本文介绍了像expand.grid这样的组合迭代器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种快速的方法可以迭代诸如expand.gridCJ(data.table)返回的组合.当有足够的组合时,它们太大而无法容纳在内存中.在itertools2库(Python的itertools的端口)中有iproduct,但是它确实很慢(至少我使用它的方式-如下所示).还有其他选择吗?

Is there a fast way to iterate through combinations like those returned by expand.grid or CJ (data.table). These get too big to fit in memory when there are enough combinations. There is iproduct in itertools2 library (port of Python's itertools) but it is really slow (at least the way I'm using it - shown below). Are there other options?

这里是一个示例,其中的想法是将函数应用于来自两个data.frames的行的每个组合(

Here is an example, where the idea is to apply a function to each combination of rows from two data.frames (previous post).

library(data.table)  # CJ
library(itertools2)  # iproduct iterator
library(doParallel)

## Dimensions of two data
dim1 <- 10
dim2 <- 100
df1 <- data.frame(a = 1:dim1, b = 1:dim1)
df2 <- data.frame(x= 1:dim2, y = 1:dim2, z = 1:dim2)

## function to apply to combinations
f <- function(...) sum(...)

## Too big to expand with bigger dimensions (ie, 1e6, 1e5) -> errors
## test <- expand.grid(seq.int(dim1), seq.int(dim2))
## test <- CJ(indx1 = seq.int(dim1), indx2 = seq.int(dim2))
## Error: cannot allocate vector of size 3.7 Gb

## Create an iterator over the cartesian product of the two dims
it <- iproduct(x=seq.int(dim1), y=seq.int(dim2))

## Setup the parallel backend
cl <- makeCluster(4)
registerDoParallel(cl)

## Run
res <- foreach(i=it, .combine=c, .packages=c("itertools2")) %dopar% {
  f(df1[i$x, ], df2[i$y, ])
}
stopCluster(cl)

## Expand.grid results (different ordering)
expgrid <- expand.grid(x=seq(dim1), y=seq(dim2))
test <- apply(expgrid, 1, function(i) f(df1[i[["x"]],], df2[i[["y"]],]))

all.equal(sort(test), sort(res))  # TRUE

推荐答案

我认为,如果给每个工作人员一个数据帧中的一个块,让他们每个都执行计算,然后再进行计算,您将获得更好的性能.结合结果.这样可以提高计算效率,并减少工作人员的内存使用量.

I think you'll get better performance if you give each of the workers a chunk of one of the data frames, have them each perform the computations, and then combine the results. This results in more efficient computation and reduced memory usage by the workers.

以下是使用itertools包中的isplitRow函数的示例:

Here is an example that uses the isplitRow function from the itertools package:

library(doParallel)
library(itertools)
dim1 <- 10
dim2 <- 100
df1 <- data.frame(a = 1:dim1, b = 1:dim1)
df2 <- data.frame(x= 1:dim2, y = 1:dim2, z = 1:dim2)
f <- function(...) sum(...)

nw <- 4
cl <- makeCluster(nw)
registerDoParallel(cl)

res <- foreach(d2=isplitRows(df2, chunks=nw), .combine=c) %dopar% {
  expgrid <- expand.grid(x=seq(dim1), y=seq(nrow(d2)))
  apply(expgrid, 1, function(i) f(df1[i[["x"]],], d2[i[["y"]],]))
}

我拆分了df2,因为它有更多行,但是您可以选择其中一个.

I split df2 because it has more rows, but you could choose either.

这篇关于像expand.grid这样的组合迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆