向量化和并行化列表分解 [英] vectorizing & parallelizing the disagregation of a list
问题描述
下面是一些代码,该代码生成一个data.frame
的列表,然后将该原始列表转换为一个新列表,每个列表元素具有每个数据帧的行列表.
Here's some code that generates a list of data.frame
s and then converts that original list into a new list with each list element a list of the rows of each data frame.
例如.
-l1
的长度为10,每个元素都是具有1000行的data.frame
.
-l2
是长度为1000(nrow(l1[[k]])
)的列表,每个元素都是长度为10(length(l1)
)的list
,其中包含来自l1
Eg.
- l1
has length 10 and each element is a data.frame
with 1000 rows.
- l2
is a list of length 1000 (nrow(l1[[k]])
) and each element is a list
of length 10 (length(l1)
) containing row-vectors from the elements of l1
l1 <- vector("list", length= 10)
set.seed(65L)
for (i in 1:10) {
l1[[i]] <- data.frame(matrix(rnorm(10000),ncol=10))
}
l2 <- vector(mode="list", length= nrow(l1[[1]]))
for (i in 1:nrow(l1[[1]])) {
l2[[i]] <- lapply(l1, function(l) return(unlist(l[i,])))
}
编辑为了阐明l1
与l2
的关系,这是与语言无关的代码.
Edit To clarify how l1
relates to l2
, here is language agnostic code.
for (j in 1:length(l1) {
for (i in 1:nrow(l1[[1]]) { # where nrow(l1[[1]]) == nrow(l1[[k]]) k= 2,...,10
l2[[i]][[j]] <- l1[[j]][i,]
}
}
如何通过矢量化或并行化加快l2
的创建速度?我遇到的问题是parallel::parLapplyLB
拆分列表;但是,我不想拆分列表l1
,我要做的是拆分l1
的每个元素中的行.一个中间解决方案将通过使用某些*apply
函数来替换for循环来矢量化我当前的方法.显然,这也可以扩展到并行解决方案.
How do I speed the creation of l2
up via vectorization or parallelization? The problem I'm having is that parallel::parLapplyLB
splits lists; however, I don't want to split the list l1
, what I want to do is split the rows within each element of l1
. An intermediate solution would vectorize my current approach by using some *apply
function to replace the for-loop. This could obviously be extended to a parallel solution as well.
如果在可以接受的解决方案之前自行解决此问题,我将在此处发布答案.
推荐答案
我将完全破坏结构,并通过split
重建第二个列表.这种方法比原始方法需要更多的内存,但至少在给定的示例中,它的速度快了10倍以上:
I would break the structure completely and rebuild the second list via split
. This approach needs much more memory than the original one but at least for the given example it is >10x faster:
sgibb <- function(x) {
## get the lengths of all data.frames (equal to `sapply(x, ncol)`)
n <- lengths(x)
## destroy the list structure
y <- unlist(x, use.names = FALSE)
## generate row indices (stores the information which row the element in y
## belongs to)
rowIndices <- unlist(lapply(n, rep.int, x=1L:nrow(x[[1L]])))
## split y first by rows
## and subsequently loop over these lists to split by columns
lapply(split(y, rowIndices), split, f=rep.int(seq_along(n), n))
}
alex <- function(x) {
l2 <- vector(mode="list", length= nrow(x[[1]]))
for (i in 1:nrow(x[[1]])) {
l2[[i]] <- lapply(x, function(l) return(unlist(l[i,])))
}
l2
}
## check.attributes is need because the names differ
all.equal(alex(l1), sgibb(l1), check.attributes=FALSE)
library(rbenchmark)
benchmark(alex(l1), sgibb(l1), order = "relative", replications = 10)
# test replications elapsed relative user.self sys.self user.child sys.child
#2 sgibb(l1) 10 0.808 1.000 0.808 0 0 0
#1 alex(l1) 10 11.970 14.814 11.972 0 0 0
这篇关于向量化和并行化列表分解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!