通过多列索引/键在dplyr中子集化（有data.table soln） [英] subsetting by multi-column index/key in dplyr (have data.table soln)

查看：140 发布时间：2017/3/12 11:52:38 r data.table dplyr

本文介绍了通过多列索引/键在dplyr中子集化（有data.table soln）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一个方法来子集（或重新思考我如何处理任务）以下情况保留在dplyr而不是度假到data.table我的分析之前/之后这个块完成在dplyr中。

情况：给定一个具有多个复制的模拟数据集，我想基于两个列键（ID和REP）子集/ dplyr :: filter。

  libs<  -  c（dplyr，data.table）
 lapply（libs，require， character.only = T）
 
＃最小可再现性示例
 
＃dataset 
 dat < -  expand.grid（ID = 1：3，REP = 5，TIME = 1：3）
 dat<  -  dat [order（dat $ REP，dat $ ID，dat $ TIME），] 
 dat $ CONC < ，1，10）
 
＃key / index 
 set.seed（1235）
 ID_sample  REP_sample<  -  sample（unique（dat $ REP），size = 5，replace = TRUE）
 key<  -  data.frame（ID = ID_sample，REP = REP_sample）
 
 
＃数据表解决方案
 dt<  -  data.table（dat）
 setkey（dt，ID，REP）
 dt_subset& （key）]

data.table解决方案产生以下结果：

初始数据结构：

  ID REP TIME CONC 
 1 1 1 1 1.310819 
 2 1 1 2 2.371361 
 3 1 1 3 7.621165 
 4 2 1 1 1.010229 
 5 2 1 2 4.520830 
 6 2 1 3 5.162452 
 .. 。
 40 2 5 1 6.629885 
 41 2 5 2 9.680233 
 42 2 5 3 8.445726 
 43 3 5 1 3.835254 
 44 3 5 2 2.917229 
 45 3 5 3 7.592465

生成的密钥和生成的子集：

 >键
 ID REP 
 1 1 3 
 2 2 3 
 3 1 4 
 4 3 3 
 5 3 2 
 
 > dt [J（key）] 
 ID REP TIME CONC 
 1：1 3 1 3.038205 
 2：1 3 2 5.361020 
 3：1 3 3 8.137065 
 4 ：2 3 1 1.053889 
 5：2 3 2 2.689412 
 6：2 3 3 7.136503 
 7：1 4 1 9.137392 
 8：1 4 2 6.556821 
 9 ：1 4 3 2.206285 
 10：3 3 1 4.330937 
 11：3 3 2 4.254630 
 12：3 3 3 8.819154 
 13：3 2 1 4.508456 
 14 ：3 2 2 7.286893 
 15：3 2 3 5.896521

这个多列索引在dplyr中过滤？

到目前为止，我唯一想到的解决方案是创建一个新列，如下所示：

  dat  KEY< paste0（ID_sample，'_'，REP_sample）
过滤器（dat，ID_REP％in％KEY）

其工作原理：

ID REP TIME CONC ID_REP 1 3 2 1 4.029622 3_2 2 3 2 2 5.786582 3_2 3 3 2 3 2.846836 3_2 4 1 3 1 4.968823 1_3 5 1 3 2 6.940782 1_3 6 1 3 3 5.017697 1_3 7 2 3 1 7.571442 2_3 8 2 3 2 6.350095 2_3 9 2 3 3 3.924427 2_3 10 3 3 1 6.360991 3_3 11 3 3 2 3.273693 3_3 12 3 3 3 4.029781 3_3 13 1 4 1 6.617855 1_4 14 1 4 2 1.910202 1_4 15 1 4 3 5.496817 1_4

解决方案

$ p>

>您正在查找半加入：

  semi_join（dat，key）

I'm looking to find a way to subset (or rethink how I handle the task) the following situation to stay in dplyr rather than "resort" to data.table as much of my analysis before/after this chunk is done in dplyr.

Situation: given a simulated dataset with multiple replications I would like to subset/dplyr::filter based on a two column key (ID and REP).

libs <- c("dplyr", "data.table")
lapply(libs, require, character.only = T)

# minimally reproducible example

# dataset
dat <- expand.grid(ID = 1:3, REP = 1:5, TIME = 1:3)
dat <- dat[order(dat$REP, dat$ID, dat$TIME),]
dat$CONC <- runif(nrow(dat), 1, 10)

# key/index
set.seed(1235)
ID_sample <- sample(unique(dat$ID), size = 5, replace = TRUE)
REP_sample <- sample(unique(dat$REP), size = 5, replace = TRUE)
key <- data.frame(ID = ID_sample, REP = REP_sample)


# data table solution
dt <- data.table(dat)
setkey(dt, ID, REP)
dt_subset <- dt[J(key)]

The data.table solution results in the following:

initial data structure:

   ID REP TIME     CONC
1   1   1    1 1.310819
2   1   1    2 2.371361
3   1   1    3 7.621165
4   2   1    1 1.010229
5   2   1    2 4.520830
6   2   1    3 5.162452
...
40  2   5    1 6.629885
41  2   5    2 9.680233
42  2   5    3 8.445726
43  3   5    1 3.835254
44  3   5    2 2.917229
45  3   5    3 7.592465

generated key and resulting subset:

> key
  ID REP
1  1   3
2  2   3
3  1   4
4  3   3
5  3   2

> dt[J(key)]
    ID REP TIME     CONC
 1:  1   3    1 3.038205
 2:  1   3    2 5.361020
 3:  1   3    3 8.137065
 4:  2   3    1 1.053889
 5:  2   3    2 2.689412
 6:  2   3    3 7.136503
 7:  1   4    1 9.137392
 8:  1   4    2 6.556821
 9:  1   4    3 2.206285
10:  3   3    1 4.330937
11:  3   3    2 4.254630
12:  3   3    3 8.819154
13:  3   2    1 4.508456
14:  3   2    2 7.286893
15:  3   2    3 5.896521

Is there a way of using this multi-column index to filter in dplyr?

The only 'solution' I've thought of so far is is to create a new column like so:

dat <- transform(dat, ID_REP = paste0(ID, '_', REP))
KEY <- paste0(ID_sample, '_', REP_sample)
filter(dat, ID_REP %in% KEY)

which works:

       ID REP TIME     CONC ID_REP
1   3   2    1 4.029622    3_2
2   3   2    2 5.786582    3_2
3   3   2    3 2.846836    3_2
4   1   3    1 4.968823    1_3
5   1   3    2 6.940782    1_3
6   1   3    3 5.017697    1_3
7   2   3    1 7.571442    2_3
8   2   3    2 6.350095    2_3
9   2   3    3 3.924427    2_3
10  3   3    1 6.360991    3_3
11  3   3    2 3.273693    3_3
12  3   3    3 4.029781    3_3
13  1   4    1 6.617855    1_4
14  1   4    2 1.910202    1_4
15  1   4    3 5.496817    1_4

but is inelegant and does not provide an easily extensible solution.

解决方案

You're looking for a semi join:

semi_join(dat, key)

这篇关于通过多列索引/键在dplyr中子集化（有data.table soln）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过多列索引/键在dplyr中子集化（有data.table soln） [英] subsetting by multi-column index/key in dplyr (have data.table soln)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过多列索引/键在dplyr中子集化（有data.table soln） [英] subsetting by multi-column index/key in dplyr (have data.table soln)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭