可行的分组子集 [英] Group-wise subsetting where feasible

查看:65
本文介绍了可行的分组子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对数据行进行子集

  library(data.table); set.seed(333); n<-100 
dat<-data.table(id = 1:n,group = rep(1:2,each = n / 2),x = runif(n,100,120),y = runif (n,200,220),z = runif(n,300,320))

>头(dat)
id组xyz
1:1 1 109.3400 208.6732 308.7595
2:2 1 101.6920 201.0989 310.1080
3:3 1 119.4697 217.8550 313.9384
4: 4 1 111.4261 205.2945 317.3651
5:5 1 100.4024 212.2826 305.1375
6:6 1 114.4711 203.6988 319.4913

在每个组中的多个阶段。我需要使它自动化,并且可能会发生子集为空的情况。例如,仅关注组1,

  dat1<-dat [1:50] 
> s< -subset(dat1,x> 119)
> s
id组xyz
1:3 1 119.4697 217.8550 313.9384
2:50 1 119.2519 214.2517 318.8567

第二步子集(s,y> 219)会变成空白,但我仍然想应用第三步子集(s,z> 315)。如果要手动设置阈值,Frank在此处提供了出色的解决方案输出

 > f(dat1,x> 119,y> 219,z> 315)
cond跳过
1:x> 119假
2:y> 219是
3:z> 315 FALSE
id组xyz
1:50 1 119.2519 214.2517 318.8567



我的问题是我需要同时将其应用于不同的组,每个组的阈值在单独的data.table中给出。目标是每个组至少有一个 id 。例如,如果我的阈值是

  c<-data.table(group = 1:2,x = c(119,119 ),y = c(219,219),z = c(315,319))
> c
组xyz
1:1 119219315
2:2119219319

我想结束

 > res 
id组xyz
1:50 1 119.2519 214.2517 318.8567
2:55 2 119.2634 219.0044 315.6556

我可以在for循环中重复应用Frank函数,但是我敢肯定有更聪明的方法可以节省时间。我想知道,例如,该功能是否可以应用于data.table中的每个组。也许在tidyverse中有一种方法,我还不太熟悉。

解决方案

使用标准评估的另一种可能方法:

 #将条件转换为长格式,将运算符存储在data.table以及
cond<-data.table(group = 1:2,bop = c(`>`,`> ;`),x = c(119,119),y = c(219,219),z = c(315,319))
thres <-melt(cond,id.vars = c( group, bop ))

#将数据转换为长格式和查找过滤器以及阈值
mdat< -melt(dat,id.vars = c( id, group))[
thres,on =。((组,变量),c( bop, thres):= mget(c( bop, i.value))]

#apply过滤
ss<-mdat [mapply(mapply(function(f,x,y)f(x,y),bop,value,thres)]

#应用顺序子集
dat [id%in%ss [,{
idx<-id
ans<-.SD [,{
x<-intersect(idx,id)
if(length(x)> 0){
idx<-x
}
idx
},。(variabl e)]

ans [variable == last(variable),V1]
},。(group)] $ V1
]

输出:

  id group xyz 
1:50 1 119.2519 214.2517 318.8567
2:55 2 119.2634 219.0044 315.6556
3:58 2 119.2211 214.0305 319.3097
4:72 2 114.0802 217.7402 313.3655
5:90 2 116.8115 215.1576 317.0261
6:99 2 119.2964 212.9973 308.9360

数据:

 库(data.table)
set.seed(333)
n<-100
dat< ;-data.table(id = 1:n,group = rep(1:2,each = n / 2),
x = runif(n,100,120),y = runif(n,200,220),z =的runif(n,300,320))

I would like to subset rows of my data

library(data.table); set.seed(333); n <- 100 
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

> head(dat)
   id group        x        y        z
1:  1     1 109.3400 208.6732 308.7595
2:  2     1 101.6920 201.0989 310.1080
3:  3     1 119.4697 217.8550 313.9384
4:  4     1 111.4261 205.2945 317.3651
5:  5     1 100.4024 212.2826 305.1375
6:  6     1 114.4711 203.6988 319.4913

in several stages within each group. I need to automate this and it might happen that the subset is empty. For example, focusing only on group 1,

dat1 <- dat[1:50]
> s <-subset(dat1,x>119)
> s
   id group        x        y        z
1:  3     1 119.4697 217.8550 313.9384
2: 50     1 119.2519 214.2517 318.8567

the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). If I were to set the threshold manually, Frank has provided an excellent solution here that outputs

> f(dat1, x>119, y>219, z>315)
      cond  skip
1: x > 119 FALSE
2: y > 219  TRUE
3: z > 315 FALSE
   id group        x        y        z
1: 50     1 119.2519 214.2517 318.8567

and reports which parts were skipped.

My problem is that I need to apply this to different groups simultaneously, where the thresholds for each group are given in a separate data.table. The goal is to have at least one id per group. For example, if my thresholds were

c <- data.table(group=1:2, x=c(119,119), y=c(219,219), z=c(315,319))
> c
   group   x   y   z
1:     1 119 219 315
2:     2 119 219 319

I would like to end up with

> res
   id group        x        y        z
1: 50     1 119.2519 214.2517 318.8567
2: 55     2 119.2634 219.0044 315.6556

I could apply Frank's function repeatedly within a for-loop but I am sure there are cleverer ways that save time. I wonder, for instance, whether the function can be applied to each group within data.table. Or perhaps there is a way within the tidyverse, which I am not really familiar with yet.

解决方案

Another possible approach using standard evaluation:

#convert conditions into long format, storing operator in data.table as well
cond <- data.table(group=1:2, bop=c(`>`, `>`), x=c(119,119), y=c(219,219), z=c(315,319))
thres <- melt(cond, id.vars=c("group","bop"))

#convert data into long format and lookup filter and thresholds
mdat <- melt(dat, id.vars=c("id", "group"))[
    thres, on=.(group, variable), c("bop","thres") := mget(c("bop","i.value"))]

#apply filtering
ss <- mdat[mapply(function(f, x, y) f(x, y), bop, value, thres)]

#apply sequential subsetting
dat[id %in% ss[, {
        idx <- id
        ans <- .SD[, {
                x <- intersect(idx, id)
                if(length(x) > 0) {
                    idx <- x
                }
                idx
            }, .(variable)]

        ans[variable==last(variable), V1]
    }, .(group)]$V1
]

output:

   id group        x        y        z
1: 50     1 119.2519 214.2517 318.8567
2: 55     2 119.2634 219.0044 315.6556
3: 58     2 119.2211 214.0305 319.3097
4: 72     2 114.0802 217.7402 313.3655
5: 90     2 116.8115 215.1576 317.0261
6: 99     2 119.2964 212.9973 308.9360

data:

library(data.table)
set.seed(333)
n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2),
    x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

这篇关于可行的分组子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆