R：可行的快速（条件）子集 [英] R: fast (conditional) subsetting where feasible

查看：94 发布时间：2020/10/15 19:05:28 r data.table subset

本文介绍了R：可行的快速（条件）子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想对数据行进行子集

library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

> head(dat)
   id        x        y        z
1:  1 109.3400 208.6732 308.7595
2:  2 101.6920 201.0989 310.1080
3:  3 119.4697 217.8550 313.9384
4:  4 111.4261 205.2945 317.3651
5:  5 100.4024 212.2826 305.1375
6:  6 114.4711 203.6988 319.4913

几个阶段。我知道我可以依次应用 subset（。）来实现这一目标。

in several stages. I am aware that I could apply subset(.) sequentially to achieve this.

> s <- subset(dat, x>119)
> s <- subset(s, y>219)
> subset(s, z>315)
   id        x        y        z
1: 55 119.2634 219.0044 315.6556

我的问题是我需要使它自动化，并且可能会发生子集为空的情况。在这种情况下，我想跳过导致空集的步骤。例如，如果我的数据是

My problem is that I need to automate this and it might happen that the subset is empty. In this case, I would want to skip the step(s) that result in an empty set. For example, if my data was

dat2 <- dat[1:50]
> s <-subset(dat2,x>119)
> s
   id        x        y        z
1:  3 119.4697 217.8550 313.9384
2: 50 119.2519 214.2517 318.8567

第二步子集（s，y> 219）会变成空白，但我仍然想应用第三步子集（s，z> 315）。是否只有在导致非空集的情况下才应用子命令的方法？我想像子集（s，y> 219，nonzero = TRUE）之类的东西。我想避免这样的构造

the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). Is there a way to apply a subset-command only if it results in a non-empty set? I imagine something like subset(s, y>219, nonzero=TRUE). I would want to avoid constructions like

s <- dat
if(nrow(subset(s, x>119))>0){s <- subset(s, x>119)}
if(nrow(subset(s, y>219))>0){s <- subset(s, y>219)}
if(nrow(subset(s, z>318))>0){s <- subset(s, z>319)}

因为我担心if-then丛林会相当慢，尤其是因为我需要将所有这些都应用于列表中的不同data.tables使用 lapply（。）。这就是为什么我希望找到针对速度进行优化的解决方案。

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.). That's why I am hoping to find a solution optimized for speed.

PS。为了清楚起见，我只选择 subset（。），例如如果不是，那么data.table将同样受欢迎。

PS. I only chose subset(.) for clarity, solutions with e.g. data.table would be just as welcome if not more so.

推荐答案

我同意Konrad的回答，即应该发出警告或至少报告以某种方式发生的情况。这是一种利用索引的data.table方法（有关详细信息，请参见程序包小插图）：

I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):

f = function(x, ..., verbose=FALSE){
  L   = substitute(list(...))[-1]
  mon = data.table(cond = as.character(L))[, skip := FALSE]

  for (i in seq_along(L)){
    d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
    if (nrow(d)){
      x = d
    } else {
      mon[i, skip := TRUE]
    }    
  }
  print(mon)
  return(x)
}

用法

> f(dat, x > 119, y > 219, y > 1e6)
        cond  skip
1:   x > 119 FALSE
2:   y > 219 FALSE
3: y > 1e+06  TRUE
   id        x        y        z
1: 55 119.2634 219.0044 315.6556

The verbose选项将打印由data.table包提供的额外信息，因此您可以查看何时使用索引。例如，使用 f（dat，x == 119，verbose = TRUE），我看到了。

The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.

因为我担心if-then丛林会相当慢，特别是因为我需要使用lapply（。）将所有这些都应用于列表中的不同data.tables。

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).

如果用于非交互使用，最好让函数返回 list（mon = mon， x = x）可以更轻松地跟踪查询内容和发生的情况。同样，可以捕获并返回详细的控制台输出。

If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.

这篇关于R：可行的快速（条件）子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：可行的快速（条件）子集 [英] R: fast (conditional) subsetting where feasible

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：可行的快速（条件）子集 [英] R: fast (conditional) subsetting where feasible

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭