Python的xrange替代R或如何遍历大型数据集lazilly? [英] Python's xrange alternative for R OR how to loop over large dataset lazilly?

查看:89
本文介绍了Python的xrange替代R或如何遍历大型数据集lazilly?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下示例基于讨论,有关将expand.grid与大型数据.如您所见,它最终会出错.我猜这是由于可能的组合所致,根据提到的第687亿页:

Following example is based on discussion about using expand.grid with large data. As you can see it ends up with error. I guess this is due to possible combinations which is according to mentioned page 68.7 billions:

> v1 <-  c(1:8)
> v2 <-  c(1:8)
> v3 <-  c(1:8)
> v4 <-  c(1:8)
> v5 <-  c(1:8)
> v6 <-  c(1:8)
> v7 <-  c(1:8)
> v8 <-  c(1:8)
> v9 <-  c(1:8)
> v10 <- c(1:8)
> v11 <- c(1:8)
> v12 <- c(1:8)
> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12)
Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : 
  invalid 'times' value
In addition: Warning message:
In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  NAs introduced by coercion to integer range

即使有八个向量,它也会杀死我的CPU和/或RAM(> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8)). 此处我发现了一些改进,建议使用outer.这些解决方案适用于两个向量,因此我无法将其应用于12个向量,但我想原理是相同的:它创建了驻留在内存中的大矩阵.我想知道是否有类似python的在这里我找到了delayedAssign函数,但我猜这不会有帮助,因为还提到了以下内容:

Even with eight vectors it kills my CPU and/or RAM (> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8)). Here I've found some improvements which suggests using outer or rep.int. Those solutions works with two vectors so I've not able to apply it for 12 vectors but I guess the principle is the same: It creates large matrix which resides in memory. I'm wondering if there is something like python's xrange which evaluates lazily? Here I've found delayedAssign function but I guess this will not help because there is also mentioned following:

不幸的是,当R指向惰性变量时,R会计算它们 数据结构,即使当时不需要它们的值.这 意味着无限的数据结构,一种常见的应用 Haskell中的懒惰,在R中是不可能的.

Unfortunately, R evaluates lazy variables when they are pointed to by a data structure, even if their value is not needed at the time. This means that infinite data structures, one common application of laziness in Haskell, are not possible in R.

使用嵌套循环是此问题的唯一解决方案吗?

Is using nested loops only solution for this problem?

PS:我没有特定的问题,但是假设您出于某种原因需要使用接受12个整数参数的函数进行一些计算.还要假设您需要对这12个整数进行所有组合,然后将结果保存到文件中.使用12个嵌套循环并将结果连续保存到文件中将起作用(尽管速度很慢,但不会杀死RAM). 此处是展示了如何使用expand.gridapply函数替换两个嵌套循环.问题是使用expand.grid创建长度为12的长度为12的矢量的矩阵存在一些缺点:

PS: I have not specific problem, but suppose you need to do some computation using function which is accepting 12 integer arguments, for some reason. Also suppose that you need to make all combinations of those 12 integers and save results to file. Using 12 nested loops and saving results to file continuously will work (despite it will be slow but it will not kill your RAM). Here is shown how you can use expand.grid and apply function to replace two nested loops. Problem is that creating such matrix with 12 vectors of length 8 using expand.grid has some disadvantages:

  1. 生成这种矩阵很慢
  2. 如此大的矩阵消耗大量内存(687亿行和8列)
  3. 使用apply在此矩阵上进行进一步迭代的速度也很慢
  1. generating such matrix is slow
  2. such large matrix consumes a lot of memory (68.7 billion rows and 8 columns)
  3. further iteration over this matrix using apply is slow also

所以在我看来,功能方法比程序解决方案要慢得多.我只是想知道是否有可能懒惰地创建大数据结构,该结构在理论上不适合内存并对其进行迭代.就是这样.

So in my point of view functional approach is much more slower than procedural solution. I'm just wondering if is possible to lazily create large data structure which in theory does not fit in to memory and iterate over it. That's all.

推荐答案

一种解决方法(可能更合适")是为@cen>建议的@cen>编写自己的迭代器(有关编写扩展名的pdf为此处).缺少更正式的内容,这是一个穷人迭代器,类似于expand.grid,但需要手动进行. (注意:考虑到每次迭代的计算比该函数本身更昂贵",这就足够了.这确实可以改进,但是有效".)

One (arguably more "proper") way to approach this would be to write your own iterator for iterators that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)

每次返回返回的函数时,此函数都会返回一个命名列表(具有提供的因子).之所以懒惰,是因为它没有扩展所有可能的清单.它本身并不懒惰,应该立即消耗掉"它们.

This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.

lazyExpandGrid <- function(...) {
  dots <- list(...)
  sizes <- sapply(dots, length, USE.NAMES = FALSE)
  indices <- c(0, rep(1, length(dots)-1))
  function() {
    indices[1] <<- indices[1] + 1
    DONE <- FALSE
    while (any(rolls <- (indices > sizes))) {
      if (tail(rolls, n=1)) return(FALSE)
      indices[rolls] <<- 1
      indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
    }
    mapply(`[`, dots, indices, SIMPLIFY = FALSE)
  }
}

样品用量:

nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
#   a  b  c
# 1 1 15 21
nxt()
#   a  b  c
# 1 2 15 21
nxt()
#   a  b  c
# 1 3 15 21
nxt()
#   a  b  c
# 1 1 16 21

## <yawn>

nxt()
#   a  b  c
# 1 3 16 22
nxt()
# [1] FALSE

注意:为简洁起见,我以as.data.frame(mapply(...))为例;无论哪种方式都可以,但是如果命名列表适合您,那么就不需要转换为data.frame.

NB: for brevity of display, I used as.data.frame(mapply(...)) for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.

编辑

基于 alexis_laz的答案,这是一个经过大量改进的版本,该版本(a)速度更快,并且(b)允许任意寻找.

Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.

lazyExpandGrid <- function(...) {
  dots <- list(...)
  argnames <- names(dots)
  if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
  sizes <- lengths(dots)
  indices <- cumprod(c(1L, sizes))
  maxcount <- indices[ length(indices) ]
  i <- 0
  function(index) {
    i <<- if (missing(index)) (i + 1L) else index
    if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
    if (i > maxcount || i < 1L) return(FALSE)
    setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L  ),
             argnames)
  }
}

它不带任何参数(自动递增内部计数器),一个参数(查找并设置内部计数器)或向量参数(查找每个并将计数器设置为最后一个,返回一个data.frame).

It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).

最后一个用例允许对设计空间的子集进行采样:

This last use-case allows for sampling a subset of the design space:

set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
#   a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
#      a  b  c  d  e  f
# 2   69 61  7  7 49 92
# 21  72 28 55 40 62 29
# 3   88 32 53 46 18 65
# 4   88 33 31 89 66 74
# 5   57 75 31 93 70 66
# 6  100 86 79 42 78 46
# 7   55 41 25 73 47 94

感谢alexis_laz对cumprodMap和索引计算的改进!

Thanks alexis_laz for the improvements of cumprod, Map, and index calculations!

这篇关于Python的xrange替代R或如何遍历大型数据集lazilly?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆