从嵌套列表中进行有效采样 [英] Efficient sampling from nested lists

查看:82
本文介绍了从嵌套列表中进行有效采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 列表列表 ,其中包含data.frames,我想从其中 仅选择几行 .我可以在for循环中实现该功能,在该循环中,我将基于行数创建一个序列,并根据该序列仅选择行索引.

I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.

但是,如果我有更深层的嵌套列表,它将不再起作用.我也确信,有更好的方法可以做到无循环.

But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.

什么是从嵌套列表中采样的有效且通用的方法,这些嵌套列表的维度不同并且包含data.frames或矩阵?

## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
  list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) { 
  tmp <- do.call(rbind, crdOrig[[r]])
  filterInd <- seq(1,nrow(tmp), by = filterBy)
  FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)

# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)

如果列表包含多个data.frames (下面的数据).

The code above works also if a list contains several data.frames (data below).

## Dummy Data (Multiple DF)
crdOrig <- list(
  list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
       data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

但是,如果一个列表包含多个列表,则尝试将结果(FiltRef)绑定在一起会引发错误.

But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.

结果可以是具有2列(x,y)的data.frame-例如crdResult或像FiltRef这样的一维列表(来自第一个示例)

The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)

## Dummy Data (Multiple Lists)
crdOrig <- list(
  list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
       list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)


+1,谢谢,感谢您的出色回答!它们都可以工作,并且每个人都需要学习很多东西.我将把这个给@ Gwang-Jin Kim,因为他的解决方案是最灵活,范围最广的,尽管它们都值得检查!


+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to @Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!

推荐答案

flatten

Preparation and implementation of flatten

嗯,还有许多其他答案在原则上是相同的.

Well, there are many other answers which are in principle the same.

与此同时,我实现了对嵌套列表进行扁平化的有趣操作.

I meanwhile implemented for fun the flattening of nested lists.

因为我在Lisp中进行思考:

Since I am thinking in Lisp:

从Lisp中首先实现carcdr.

Implemented first car and cdr from lisp.

car <- function(l) {
  if(is.list(l)) {
    if (null(l)) {
      list()
    } else {
      l[[1]]
    }
  } else {
    error("Not a list.")
  }
}

cdr <- function(l) {
  if (is.list(l)) {
    if (null(l) || length(l) == 1) {
      list()
    } else {
      l[2:length(l)]
    }
  } else {
    error("Not a list.")
  }
}

一些谓词功能:

null <- function(l) length(l) == 0   
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`

# upon @Martin Morgan's comment removed other predicate functions
# thank you @Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.

构建扁平化(对于数据框列表)有必要

Which are necessary to build flatten (for data frame lists)

flatten <- function(nested.list.construct) {
  # Implemented Lisp's flatten tail call recursively. (`..flatten()`)
  # Instead of (atom l) (is.df l).
  ..flatten <- function(l, acc.l) { 
    if (null(l)) {
      acc.l
    } else if (is.data.frame(l)) {   # originally one checks here for is.atom(l)
      acc.l[[length(acc.l) + 1]] <- l
      acc.l # kind of (list* l acc.l)
    } else {
      ..flatten(car(l), ..flatten(cdr(l), acc.l))
    }
  }
  ..flatten(nested.list.construct, list())
}

# an atom is in the widest sence a non-list object

此后,使用采样函数定义实际函数.

After this, the actual function is defined using a sampling function.

定义采样功能

# helper function
nrow <- function(df) dim(df)[1L]

# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
  # Randomly selects a fraction of the rows of a data frame
  nr <- nrow(df) 
  df[sample(nr, fraction * nr), , drop = FALSE]
}

实际的收集器功能(来自嵌套的数据框列表)

collect.df.samples <- function(df.list.construct, fraction = 1/10) {
  do.call(rbind, 
         lapply(flatten(df.list.construct), 
                function(df) sample.one.nth.of.rows(df, fraction)
               )
        )
}
# thanks for the improvement with `do.call(rbind, [list])` @Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.

collect.df.samples首先将数据帧df.list.construct的嵌套列表构造展平为数据帧的展平列表.它将功能sample.one.nth.of.rows应用于列表中的每个元素(lapply).由此产生一个采样数据帧列表(包含分数-在这里是原始数据帧行的1/10).这些采样的数据帧在整个列表中rbind.返回结果数据帧.它由每个数据帧的采样行组成.

collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.

对示例进行测试

## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
  list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
       list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

collect.df.samples(crdOrig, fraction = 1/10)

重构以供以后修改

通过将collect.df.samples函数编写为:

# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)

# refactored:
collect.df.samples <- 
  function(df.list.construct, 
           df.sampler.fun = sample.10th.fraction) {
  do.call(rbind, 
          lapply(flatten(df.list.construct), df.sampler.fun))
}

可以使采样器功能可替换. (如果没有,则可以通过更改fraction参数来增强或减少从每个数据帧收集的行数.)

One can make the sampler function replace-able. (And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)

在此定义中,采样器功能可以轻松交换

要选择数据帧中的每第n行(例如每10行),而不是随机采样, 你可以例如使用采样器功能:

For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling, you could e.g. use the sampler function:

df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]

,然后在collect.df.samples中将其输入为df.sampler.fun =.然后,此功能将应用于嵌套的df列表对象中的每个数据帧,并收集到一个数据帧中.

and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.

every.10th.rows <- function(df, nth = 10) {
  df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}

a.10th.of.all.rows <- function(df, fraction = 1/10) {
  sample.one.nth.of.rows(df, fraction)
}

collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)

这篇关于从嵌套列表中进行有效采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆