可行的分组条件条件子集 [英] Group-wise conditional subsetting where feasible

查看:86
本文介绍了可行的分组条件条件子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对数据行进行子集

library(data.table); set.seed(333); n <- 100 
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

> head(dat)
   id group        x        y        z
1:  1     1 109.3400 208.6732 308.7595
2:  2     1 101.6920 201.0989 310.1080
3:  3     1 119.4697 217.8550 313.9384
4:  4     1 111.4261 205.2945 317.3651
5:  5     1 100.4024 212.2826 305.1375
6:  6     1 114.4711 203.6988 319.4913

分几个阶段,除非它导致一个空子集。在这种情况下,我想跳过该特定子设置。在先前的问题中,弗兰克找到了一个很好的解决方案

in several stages, unless it results in an empty subset. In this case, I would like to skip that specific subsetting. In an earlier question, Frank has found a great solution for this:

f = function(x, ..., verbose=FALSE){
  L   = substitute(list(...))[-1]
  mon = data.table(cond = as.character(L))[, skip := FALSE]

  for (i in seq_along(L)){
    d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
    if (nrow(d)){
      x = d
    } else {
      mon[i, skip := TRUE]
    }    
  }
  print(mon)
  return(x)
}

我可以在其中输入数据,以及手动设置每个变量。

where I can enter the data, and the cut-offs for each variable manually.

> f(dat, x > 119, y > 219, y > 1e6)
        cond  skip
1:   x > 119 FALSE
2:   y > 219 FALSE
3: y > 1e+06  TRUE
   id  group         x        y        z
1: 55      2  119.2634 219.0044 315.6556

我现在想知道如何(甚至更好!)将截止值用于第二个数据。表

I now wonder how this (or something even better!) could be applied to a case where the cut-offs are in a second data.table

c <- data.table(group=1:2, x=c(110,119), y=c(210,219), z=c(310,319))
> c
   group   x   y   z
1:     1 110 210 310
2:     2 119 219 319

并分别为每个组指定。

and specified for each group separately.

如果我要使用 f(。),我想到了的联接c 转换为 dat ,但无法弄清楚。但是,也许完全有一种更聪明的方法。

If I were to use f(.), I thought of a join of c into dat but can't figure it out. But perhaps there is a smarter way entirely.

推荐答案

首先,我将更改 c 的构造方式。当前,您已将每个过滤器设置为一列,但是长格式将允许您在同一列上使用多个过滤器,类似于您的初始示例(即y上有两个过滤器):

First, I would change how c is constructed. You currently have it set up with one column per filter, but a long format would allow you to use multiple filters on the same column similar to your initial example (i.e. two filters on y):

c <- data.table(group=c(1,2,1,2,1,2,1),variable = c("x","x","y","y","z","z","y"), c_val = c(110,119,210,219,310,319,1e6))
c[, c_id := 1:.N]
c
       group variable   c_val c_id
1:     1        x     110    1
2:     2        x     119    2
3:     1        y     210    3
4:     2        y     219    4
5:     1        z     310    5
6:     2        z     319    6
7:     1        y 1000000    7

然后可以将过滤器合并到数据中。

you can then merge your filters to your data.

dat_cut <- melt(dat, id.vars = c("id", "group"), value.name = "dat_val")
output <- merge(dat_cut, c, by = c("group","variable"), allow.cartesian = TRUE)

此行然后测试过滤​​器rs-如果要扩展过滤逻辑(大于/小于,等于等),可以扩展此行,并且可以将该逻辑编码回 c

This line then tests the filters - you can expand this line if you want to expand your filter logic (greater than / less than, equal etc.), and can code that logic back to c

output <- output[dat_val > c_val]

然后,您要查找满足条件的过滤器数量等于唯一总数的任何行该组满足的过滤器数量:

You then want to find any line where the number of filters met is equal to the unique total number of filters met, for that group:

output[,req_match := uniqueN(c_id), by = .(group)] # number of filters where a condition was met.
selection <- output[,.N,by = .(id, group, req_match)][N == req_match, id]

如果过滤器不匹配任何行,它将在此处排除。

If a filter did not match any rows, it will be excluded here.

然后,您可以过滤初始数据集解决方案:

Then you can filter your initial dataset for the solution:

dat[id %in% selection]

   id group        x        y        z
1:  3     1 119.4697 217.8550 313.9384
2: 18     1 117.2930 216.5670 310.4617
3: 35     1 110.4283 218.6130 312.0904
4: 50     1 119.2519 214.2517 318.8567

这篇关于可行的分组条件条件子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆