向量化和并行化列表分解 [英] vectorizing & parallelizing the disagregation of a list

查看:181
本文介绍了向量化和并行化列表分解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是一些代码,该代码生成一个data.frame的列表,然后将该原始列表转换为一个新列表,每个列表元素具有每个数据帧的行列表.

Here's some code that generates a list of data.frames and then converts that original list into a new list with each list element a list of the rows of each data frame.

例如.
-l1的长度为10,每个元素都是具有1000行的data.frame.
-l2是长度为1000(nrow(l1[[k]]))的列表,每个元素都是长度为10(length(l1))的list,其中包含来自l1

Eg.
- l1 has length 10 and each element is a data.frame with 1000 rows.
- l2 is a list of length 1000 (nrow(l1[[k]])) and each element is a list of length 10 (length(l1)) containing row-vectors from the elements of l1

l1 <- vector("list", length= 10)
set.seed(65L)
for (i in 1:10) {
  l1[[i]] <- data.frame(matrix(rnorm(10000),ncol=10))
}

l2 <- vector(mode="list", length= nrow(l1[[1]]))
for (i in 1:nrow(l1[[1]])) {
  l2[[i]] <- lapply(l1, function(l) return(unlist(l[i,])))
}

编辑为了阐明l1l2的关系,这是与语言无关的代码.

Edit To clarify how l1 relates to l2, here is language agnostic code.

for (j in 1:length(l1) {
  for (i in 1:nrow(l1[[1]]) { # where nrow(l1[[1]]) == nrow(l1[[k]]) k= 2,...,10
    l2[[i]][[j]] <- l1[[j]][i,]
  }
}

如何通过矢量化或并行化加快l2的创建速度?我遇到的问题是parallel::parLapplyLB拆分列表;但是,我不想拆分列表l1,我要做的是拆分l1的每个元素中的行.一个中间解决方案将通过使用某些*apply函数来替换for循环来矢量化我当前的方法.显然,这也可以扩展到并行解决方案.

How do I speed the creation of l2 up via vectorization or parallelization? The problem I'm having is that parallel::parLapplyLB splits lists; however, I don't want to split the list l1, what I want to do is split the rows within each element of l1. An intermediate solution would vectorize my current approach by using some *apply function to replace the for-loop. This could obviously be extended to a parallel solution as well.

如果在可以接受的解决方案之前自行解决此问题,我将在此处发布答案.

推荐答案

我将完全破坏结构,并通过split重建第二个列表.这种方法比原始方法需要更多的内存,但至少在给定的示例中,它的速度快了10倍以上:

I would break the structure completely and rebuild the second list via split. This approach needs much more memory than the original one but at least for the given example it is >10x faster:

sgibb <- function(x) {
  ## get the lengths of all data.frames (equal to `sapply(x, ncol)`)
  n <- lengths(x)
  ## destroy the list structure
  y <- unlist(x, use.names = FALSE)
  ## generate row indices (stores the information which row the element in y
  ## belongs to)
  rowIndices <- unlist(lapply(n, rep.int, x=1L:nrow(x[[1L]])))
  ## split y first by rows
  ## and subsequently loop over these lists to split by columns
  lapply(split(y, rowIndices), split, f=rep.int(seq_along(n), n))
}

alex <- function(x) {
  l2 <- vector(mode="list", length= nrow(x[[1]]))
  for (i in 1:nrow(x[[1]])) {
    l2[[i]] <- lapply(x, function(l) return(unlist(l[i,])))
  }
  l2
}

## check.attributes is need because the names differ
all.equal(alex(l1), sgibb(l1), check.attributes=FALSE)

library(rbenchmark)
benchmark(alex(l1), sgibb(l1), order = "relative", replications = 10)
#       test replications elapsed relative user.self sys.self user.child sys.child
#2 sgibb(l1)           10   0.808    1.000     0.808        0          0         0
#1  alex(l1)           10  11.970   14.814    11.972        0          0         0

这篇关于向量化和并行化列表分解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆