通过多列索引/键在dplyr中子集化(有data.table soln) [英] subsetting by multi-column index/key in dplyr (have data.table soln)
问题描述
我正在寻找一个方法来子集(或重新思考我如何处理任务)以下情况保留在dplyr而不是度假到data.table我的分析之前/之后这个块完成在dplyr中。
情况:给定一个具有多个复制的模拟数据集,我想基于两个列键(ID和REP)子集/ dplyr :: filter。
libs< - c(dplyr,data.table)
lapply(libs,require, character.only = T)
#最小可再现性示例
#dataset
dat < - expand.grid(ID = 1:3,REP = 5,TIME = 1:3)
dat< - dat [order(dat $ REP,dat $ ID,dat $ TIME),]
dat $ CONC < ,1,10)
#key / index
set.seed(1235)
ID_sample REP_sample< - sample(unique(dat $ REP),size = 5,replace = TRUE)
key< - data.frame(ID = ID_sample,REP = REP_sample)
#数据表解决方案
dt< - data.table(dat)
setkey(dt,ID,REP)
dt_subset& (key)]
data.table解决方案产生以下结果:
初始数据结构:
ID REP TIME CONC
1 1 1 1 1.310819
2 1 1 2 2.371361
3 1 1 3 7.621165
4 2 1 1 1.010229
5 2 1 2 4.520830
6 2 1 3 5.162452
.. 。
40 2 5 1 6.629885
41 2 5 2 9.680233
42 2 5 3 8.445726
43 3 5 1 3.835254
44 3 5 2 2.917229
45 3 5 3 7.592465
生成的密钥和生成的子集:
>键
ID REP
1 1 3
2 2 3
3 1 4
4 3 3
5 3 2
> dt [J(key)]
ID REP TIME CONC
1:1 3 1 3.038205
2:1 3 2 5.361020
3:1 3 3 8.137065
4 :2 3 1 1.053889
5:2 3 2 2.689412
6:2 3 3 7.136503
7:1 4 1 9.137392
8:1 4 2 6.556821
9 :1 4 3 2.206285
10:3 3 1 4.330937
11:3 3 2 4.254630
12:3 3 3 8.819154
13:3 2 1 4.508456
14 :3 2 2 7.286893
15:3 2 3 5.896521
这个多列索引在dplyr中过滤?
到目前为止,我唯一想到的解决方案是创建一个新列,如下所示:
dat KEY< paste0(ID_sample,'_',REP_sample)
过滤器(dat,ID_REP%in%KEY)
其工作原理:
ID REP TIME CONC ID_REP
$ p
1 3 2 1 4.029622 3_2
2 3 2 2 5.786582 3_2
3 3 2 3 2.846836 3_2
4 1 3 1 4.968823 1_3
5 1 3 2 6.940782 1_3
6 1 3 3 5.017697 1_3
7 2 3 1 7.571442 2_3
8 2 3 2 6.350095 2_3
9 2 3 3 3.924427 2_3
10 3 3 1 6.360991 3_3
11 3 3 2 3.273693 3_3
12 3 3 3 4.029781 3_3
13 1 4 1 6.617855 1_4
14 1 4 2 1.910202 1_4
15 1 4 3 5.496817 1_4
解决方案
$ p>
>您正在查找半加入:
semi_join(dat,key)
I'm looking to find a way to subset (or rethink how I handle the task) the following situation to stay in dplyr rather than "resort" to data.table as much of my analysis before/after this chunk is done in dplyr.
Situation: given a simulated dataset with multiple replications I would like to subset/dplyr::filter based on a two column key (ID and REP).
libs <- c("dplyr", "data.table") lapply(libs, require, character.only = T) # minimally reproducible example # dataset dat <- expand.grid(ID = 1:3, REP = 1:5, TIME = 1:3) dat <- dat[order(dat$REP, dat$ID, dat$TIME),] dat$CONC <- runif(nrow(dat), 1, 10) # key/index set.seed(1235) ID_sample <- sample(unique(dat$ID), size = 5, replace = TRUE) REP_sample <- sample(unique(dat$REP), size = 5, replace = TRUE) key <- data.frame(ID = ID_sample, REP = REP_sample) # data table solution dt <- data.table(dat) setkey(dt, ID, REP) dt_subset <- dt[J(key)]
The data.table solution results in the following:
initial data structure:
ID REP TIME CONC 1 1 1 1 1.310819 2 1 1 2 2.371361 3 1 1 3 7.621165 4 2 1 1 1.010229 5 2 1 2 4.520830 6 2 1 3 5.162452 ... 40 2 5 1 6.629885 41 2 5 2 9.680233 42 2 5 3 8.445726 43 3 5 1 3.835254 44 3 5 2 2.917229 45 3 5 3 7.592465
generated key and resulting subset:
> key ID REP 1 1 3 2 2 3 3 1 4 4 3 3 5 3 2 > dt[J(key)] ID REP TIME CONC 1: 1 3 1 3.038205 2: 1 3 2 5.361020 3: 1 3 3 8.137065 4: 2 3 1 1.053889 5: 2 3 2 2.689412 6: 2 3 3 7.136503 7: 1 4 1 9.137392 8: 1 4 2 6.556821 9: 1 4 3 2.206285 10: 3 3 1 4.330937 11: 3 3 2 4.254630 12: 3 3 3 8.819154 13: 3 2 1 4.508456 14: 3 2 2 7.286893 15: 3 2 3 5.896521
Is there a way of using this multi-column index to filter in dplyr?
The only 'solution' I've thought of so far is is to create a new column like so:
dat <- transform(dat, ID_REP = paste0(ID, '_', REP)) KEY <- paste0(ID_sample, '_', REP_sample) filter(dat, ID_REP %in% KEY)
which works:
ID REP TIME CONC ID_REP 1 3 2 1 4.029622 3_2 2 3 2 2 5.786582 3_2 3 3 2 3 2.846836 3_2 4 1 3 1 4.968823 1_3 5 1 3 2 6.940782 1_3 6 1 3 3 5.017697 1_3 7 2 3 1 7.571442 2_3 8 2 3 2 6.350095 2_3 9 2 3 3 3.924427 2_3 10 3 3 1 6.360991 3_3 11 3 3 2 3.273693 3_3 12 3 3 3 4.029781 3_3 13 1 4 1 6.617855 1_4 14 1 4 2 1.910202 1_4 15 1 4 3 5.496817 1_4
but is inelegant and does not provide an easily extensible solution.
解决方案You're looking for a semi join:
semi_join(dat, key)
这篇关于通过多列索引/键在dplyr中子集化(有data.table soln)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!