R子集化数据帧,基于3列的组合并且排除重复组合 [英] R subsetting dataframe based on the combination of 3 columns and excluding duplicate combinations
问题描述
我有一个数据集看起来像:
I have a dataset which looks like:
Experiment Sequence Parameter Time
Exp1 AAAA 2 10
Exp2 AAAA 2 11
Exp3 AAAA 2 12
Exp1 BBBB 2 13.1
Exp1 BBBB 3 13.2
Exp1 BBBB 4 13.3
Exp2 BBBB 2 14.1
Exp2 BBBB 3 14.2
Exp3 BBBB 2 16.3
Exp3 BBBB 3 16.4
Exp3 BBBB 4 16.5
Exp3 BBBB 5 16.6
Exp1 CCCC 2 20
Exp2 CCCC 2 22.2
Exp1 DDDD 3 22.3
Exp1 DDDD 2 22.4
Exp2 DDDD 3 25.2
Exp2 DDDD 2 25.3
Exp3 DDDD 3 27
Exp1 EEEE 2 28
Exp2 EEEE 3 29
Exp3 EEEE 4 30
Exp1 FFFF 2 33.2
Exp1 FFFF 3 33.4
Exp1 FFFF 4 33.6
Exp2 FFFF 2 35.1
Exp2 FFFF 3 35.2
Exp1 GGGG 2 40.1
Exp1 GGGG 2 40.2
Exp1 GGGG 2 40.3
Exp1 GGGG 2 42
Exp2 GGGG 2 42.3
Exp2 GGGG 2 44.3
Exp3 GGGG 2 45.3
Exp3 GGGG 2 45.4
数据集具有:
- 许多实验
- 许多序列可以存在于一个或多个实验
- 每个序列可以有3-4个不同的参数
),我需要首先根据前3列的组合对数据框进行子集:实验,序列和参数。规则是:
For my analysis (on the time), I need at first to subset the dataframe depending on the combination of the first 3 columns: Experiment, Sequence and Parameter. Rules are:
- 序列+参数组合应该出现在数据框架的所有实验中。如果不是,则应排除。
在示例中:BBBB + 4,BBBB + 5,CCCC + 2,...应该离开,BBBB + 2,DDDD + 3,....可以留下 - 如果序列+参数组合在同一实验中存在多次,这些组合也应排除。
在示例中:GGGG + 2
因此,示例中的数据框应在子集化之后变成如下:
So the dataframe in the example should become like this after subsetting:
Experiment Sequence Parameter Time
Exp1 AAAA 2 10
Exp2 AAAA 2 11
Exp3 AAAA 2 12
Exp1 BBBB 2 13.1
Exp2 BBBB 2 14.1
Exp3 BBBB 2 16.3
Exp1 BBBB 3 13.2
Exp2 BBBB 3 14.2
Exp3 BBBB 3 16.4
Exp1 DDDD 3 22.3
Exp2 DDDD 3 25.2
Exp3 DDDD 3 27
有人可以帮助我吗?
谢谢!
Can someone help me? Thank you!
Experiment <- c("Exp1", "Exp2", "Exp3", "Exp1", "Exp1", "Exp1", "Exp2", "Exp2", "Exp3", "Exp3", "Exp3", "Exp3", "Exp1", "Exp2", "Exp1", "Exp1", "Exp2", "Exp2", "Exp3", "Exp1", "Exp2", "Exp3", "Exp1", "Exp1", "Exp1", "Exp2", "Exp2", "Exp1", "Exp1", "Exp1", "Exp1", "Exp2", "Exp2", "Exp3", "Exp3")
Sequence <- c("AAAA", "AAAA", "AAAA", "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "BBBB", "BBBB","BBBB", "BBBB", "CCCC", "CCCC", "DDDD", "DDDD", "DDDD", "DDDD", "DDDD", "EEEE", "EEEE", "EEEE", "FFFF", "FFFF", "FFFF", "FFFF", "FFFF", "GGGG", "GGGG", "GGGG", "GGGG", "GGGG", "GGGG", "GGGG", "GGGG")
Parameter <- c("2", "2", "2", "2", "3", "4", "2", "3", "2", "3", "4", "5", "2", "2", "3", "2", "3", "2", "3", "2", "3", "4", "2", "3", "4", "2", "3", "2", "2", "2", "2", "2", "2", "2", "2")
Time <- c(10.0, 11.0, 12.0, 13.1, 13.2, 13.3, 14.1, 14.2, 16.3, 16.4, 16.5, 16.6, 20.0, 22.2, 22.3, 22.4, 25.2, 25.3, 27.0, 28.0, 29.0, 30.0, 33.2, 33.4, 33.6, 35.1, 35.2, 40.1, 40.2, 40.3, 42.0, 42.3, 44.3, 45.3, 45.4)
df <- data.frame(Experiment, Sequence, Parameter, Time)
推荐答案
一个选项是 data.table
。我们将'data.frame'转换为'data.table'( setDT(df)
,按'Sequence','Parameter', if c>
中的
元素的频率是3,我们Subset的Data.table(
.SD
),那么,如果nrow等于1( .N == 1
),我们分组'Experiment','Sequence'我们Subset数据表( .SD
)。
One option is data.table
. We convert the 'data.frame' to 'data.table' (setDT(df)
, grouped by 'Sequence', 'Parameter', if
the frequency of unique
elements in 'Experiment' is 3, we Subset the Data.table (.SD
), then, we group by 'Experiment', 'Sequence', and 'Parameter' if the nrow is equal to 1 (.N==1
) we Subset the Data.table (.SD
).
library(data.table)
setDT(df)[, if(uniqueN(Experiment)==3) .SD, by = .(Sequence, Parameter)
][,if(.N ==1) .SD , by = .(Experiment,Sequence, Parameter)]
# Experiment Sequence Parameter Time
# 1: Exp1 AAAA 2 10.0
# 2: Exp2 AAAA 2 11.0
# 3: Exp3 AAAA 2 12.0
# 4: Exp1 BBBB 2 13.1
# 5: Exp2 BBBB 2 14.1
# 6: Exp3 BBBB 2 16.3
# 7: Exp1 BBBB 3 13.2
# 8: Exp2 BBBB 3 14.2
# 9: Exp3 BBBB 3 16.4
#10: Exp1 DDDD 3 22.3
#11: Exp2 DDDD 3 25.2
#12: Exp3 DDDD 3 27.0
这篇关于R子集化数据帧,基于3列的组合并且排除重复组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!