可行的分组子集 [英] Group-wise subsetting where feasible

查看：65 发布时间：2020/10/15 21:09:32 r data.table subset tidyverse

本文介绍了可行的分组子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想对数据行进行子集

  library（data.table）; set.seed（333）; n<-100 
 dat<-data.table（id = 1：n，group = rep（1：2，each = n / 2），x = runif（n，100,120），y = runif （n，200,220），z = runif（n，300,320））
 
>头（dat）
 id组xyz 
 1：1 1 109.3400 208.6732 308.7595 
 2：2 1 101.6920 201.0989 310.1080 
 3：3 1 119.4697 217.8550 313.9384 
 4： 4 1 111.4261 205.2945 317.3651 
 5：5 1 100.4024 212.2826 305.1375 
 6：6 1 114.4711 203.6988 319.4913

在每个组中的多个阶段。我需要使它自动化，并且可能会发生子集为空的情况。例如，仅关注组1，

  dat1<-dat [1:50] 
> s< -subset（dat1，x> 119）
> s 
 id组xyz 
 1：3 1 119.4697 217.8550 313.9384 
 2：50 1 119.2519 214.2517 318.8567

第二步子集（s，y> 219）会变成空白，但我仍然想应用第三步子集（s，z> 315）。如果要手动设置阈值，Frank在此处提供了出色的解决方案输出

 > f（dat1，x＞ 119，y＞ 219，z＞ 315）
 cond跳过
 1：x> 119假
 2：y> 219是
 3：z> 315 FALSE 
 id组xyz 
 1：50 1 119.2519 214.2517 318.8567

我的问题是我需要同时将其应用于不同的组，每个组的阈值在单独的data.table中给出。目标是每个组至少有一个 id 。例如，如果我的阈值是

  c<-data.table（group = 1：2，x = c（119,119 ），y = c（219,219），z = c（315,319））
> c 
组xyz 
 1：1 119219315 
 2：2119219319

我想结束

 > res 
 id组xyz 
 1：50 1 119.2519 214.2517 318.8567 
 2：55 2 119.2634 219.0044 315.6556

我可以在for循环中重复应用Frank函数，但是我敢肯定有更聪明的方法可以节省时间。我想知道，例如，该功能是否可以应用于data.table中的每个组。也许在tidyverse中有一种方法，我还不太熟悉。

解决方案

使用标准评估的另一种可能方法：

 ＃将条件转换为长格式，将运算符存储在data.table以及
 cond<-data.table（group = 1：2，bop = c（`>`，`> ;`），x = c（119,119），y = c（219,219），z = c（315,319））
 thres <-melt（cond，id.vars = c（ group， bop ））
 
＃将数据转换为长格式和查找过滤器以及阈值
 mdat< -melt（dat，id.vars = c（ id， group））[
 thres，on =。（（组，变量），c（ bop， thres）：= mget（c（ bop， i.value））] 
 
 #apply过滤
 ss<-mdat [mapply（mapply（function（f，x，y）f（x，y），bop，value，thres）] 
 
＃应用顺序子集
 dat [id％in％ss [，{
 idx<-id 
 ans<-.SD [，{
x<-intersect（idx，id）
 if（length（x）> 0）{
 idx<-x 
} 
 idx 
}，。（variabl e）] 
 
 ans [variable == last（variable），V1] 
}，。（group）] $ V1 
]

输出：

  id group xyz 
 1：50 1 119.2519 214.2517 318.8567 
 2:55 2 119.2634 219.0044 315.6556 
 3:58 2 119.2211 214.0305 319.3097 
 4：72 2 114.0802 217.7402 313.3655 
 5：90 2 116.8115 215.1576 317.0261 
 6：99 2 119.2964 212.9973 308.9360

数据：

 库（data.table）
 set.seed（333）
n<-100 
 dat< ;-data.table（id = 1：n，group = rep（1：2，each = n / 2），
x = runif（n，100,120），y = runif（n，200,220），z =的runif（n，300,320））
I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100 
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

> head(dat)
   id group        x        y        z
1:  1     1 109.3400 208.6732 308.7595
2:  2     1 101.6920 201.0989 310.1080
3:  3     1 119.4697 217.8550 313.9384
4:  4     1 111.4261 205.2945 317.3651
5:  5     1 100.4024 212.2826 305.1375
6:  6     1 114.4711 203.6988 319.4913
in several stages within each group. I need to automate this and it might happen that the subset is empty. For example, focusing only on group 1,
dat1 <- dat[1:50]
> s <-subset(dat1,x>119)
> s
   id group        x        y        z
1:  3     1 119.4697 217.8550 313.9384
2: 50     1 119.2519 214.2517 318.8567
the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). If I were to set the threshold manually, Frank has provided an excellent solution here that outputs
> f(dat1, x>119, y>219, z>315)
      cond  skip
1: x > 119 FALSE
2: y > 219  TRUE
3: z > 315 FALSE
   id group        x        y        z
1: 50     1 119.2519 214.2517 318.8567
and reports which parts were skipped.

My problem is that I need to apply this to different groups simultaneously, where the thresholds for each group are given in a separate data.table. The goal is to have at least one id per group. For example, if my thresholds were 
c <- data.table(group=1:2, x=c(119,119), y=c(219,219), z=c(315,319))
> c
   group   x   y   z
1:     1 119 219 315
2:     2 119 219 319
I would like to end up with
> res
   id group        x        y        z
1: 50     1 119.2519 214.2517 318.8567
2: 55     2 119.2634 219.0044 315.6556
I could apply Frank's function repeatedly within a for-loop but I am sure there are cleverer ways that save time. I wonder, for instance, whether the function can be applied to each group within data.table. Or perhaps there is a way within the tidyverse, which I am not really familiar with yet. 
 解决方案 
Another possible approach using standard evaluation:
#convert conditions into long format, storing operator in data.table as well
cond <- data.table(group=1:2, bop=c(`>`, `>`), x=c(119,119), y=c(219,219), z=c(315,319))
thres <- melt(cond, id.vars=c("group","bop"))

#convert data into long format and lookup filter and thresholds
mdat <- melt(dat, id.vars=c("id", "group"))[
    thres, on=.(group, variable), c("bop","thres") := mget(c("bop","i.value"))]

#apply filtering
ss <- mdat[mapply(function(f, x, y) f(x, y), bop, value, thres)]

#apply sequential subsetting
dat[id %in% ss[, {
        idx <- id
        ans <- .SD[, {
                x <- intersect(idx, id)
                if(length(x) > 0) {
                    idx <- x
                }
                idx
            }, .(variable)]

        ans[variable==last(variable), V1]
    }, .(group)]$V1
]
output:
   id group        x        y        z
1: 50     1 119.2519 214.2517 318.8567
2: 55     2 119.2634 219.0044 315.6556
3: 58     2 119.2211 214.0305 319.3097
4: 72     2 114.0802 217.7402 313.3655
5: 90     2 116.8115 215.1576 317.0261
6: 99     2 119.2964 212.9973 308.9360
data:
library(data.table)
set.seed(333)
n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2),
    x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))


                        
这篇关于可行的分组子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

可行的分组子集 [英] Group-wise subsetting where feasible

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

可行的分组子集 [英] Group-wise subsetting where feasible

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭