如何筛选具有多个条件的行 [英] How to filter rows with multiple conditions
问题描述
我是R的新手.我正在尝试根据多种条件从data.frame(df)过滤行:
I am new to R. I'm trying to filter rows from a data.frame (df) based on multiple conditions:
我的data.frame的示例:我的df图片
An example of my data.frame: image of my df
df:
SNPA SNPB value block1 block2 score_T
A1 A22 0.379927 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 12
A2 A23 0.449074 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 25
A3 A24 0.464135 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 584
A4 A22 0.328866 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 51
A5 A22 0.326026 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 64
A22 A27 0.57169 A22|A23|A24|A25 A27|A28|A29|A30|A31 77
A23 A28 0.416178 A22|A23|A24|A25 A27|A28|A29|A30|A31 90
A24 A29 0.456144 A22|A23|A24|A25 A27|A28|A29|A30|A31 103
A34 A39 0.379927 A31|A32|A33|A34 A39|A40|A41|A42 116
A34 A40 0.759074 A31|A32|A33|A34 A39|A40|A41|A42 129
A34 A41 0.562303 A31|A32|A33|A34 A39|A40|A41|A42 142
A39 A57 0.322303 A39|A40|A41|A42 A52|A53|A54|A55|A56|A57|A58|A59|A60|A61 25
A40 A57 0.372303 A39|A40|A41|A42 A52|A53|A54|A55|A56|A57|A58|A59|A60|A61 198
A41 A57 0.562303 A39|A40|A41|A42 A52|A53|A54|A55|A56|A57|A58|A59|A60|A61 356
我想要的是使用 dplyr
来仅保留块(block1和block2)至少具有两个SNP的行(分别来自block的SNPA列和block2的SNPB列),并删除包含1个SNP的成对块(例如:第9到14行).
What I want is to use dplyr
to keep only the rows where the blocks (block1 and block2) have at least two SNPs (from the column SNPA for block1, and SNPB for block2) each, and remove pairs of blocks containing 1 SNP (example: rows 9 to 14).
想要的结果:结果
SNPA SNPB value block1 block2 score_T
A1 A22 0.379927 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 12
A2 A23 0.449074 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 25
A3 A24 0.464135 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 584
A4 A22 0.328866 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 51
A5 A22 0.326026 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 64
A22 A27 0.57169 A22|A23|A24|A25 A27|A28|A29|A30|A31 77
A23 A28 0.416178 A22|A23|A24|A25 A27|A28|A29|A30|A31 90
A24 A29 0.456144 A22|A23|A24|A25 A27|A28|A29|A30|A31 103
您知道我该怎么做吗?
result <- df %>% group_by(block1, block2) %>% filter(...) %>% summarise(mean_s = mean(score_T), number = n())
谢谢.
推荐答案
一个有点慢的 base-dplyr
解决方案.该解决方案的一些问题包括需要在我们的过滤器功能中手动设置块"和快照".可能可以使这一过程自动化.
A somewhat slow base-dplyr
solution. Some issues with this solution include the need to manually set "blocks" and "snps" in our filter function. One could probably automate this process.
my_filter <- function(df,block, snp){
res<-strsplit(df[[block]],"|", fixed= TRUE)
lengths(lapply(res, function(x) which(x %in% df[[snp]]))) > 1
}
df %>%
filter(my_filter(., "block1", "SNPA"), my_filter(., "block2","SNPB"))
SNPA SNPB value block1 block2 score_T
1 A1 A22 0.379927 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 12
2 A2 A23 0.449074 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 25
3 A3 A24 0.464135 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 584
4 A4 A22 0.328866 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 51
5 A5 A22 0.326026 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 64
6 A22 A27 0.571690 A22|A23|A24|A25 A27|A28|A29|A30|A31 77
7 A23 A28 0.416178 A22|A23|A24|A25 A27|A28|A29|A30|A31 90
8 A24 A29 0.456144 A22|A23|A24|A25 A27|A28|A29|A30|A31 103
数据:
df <-structure(list(SNPA = c("A1", "A2", "A3", "A4", "A5", "A22",
"A23", "A24", "A34", "A34", "A34", "A39", "A40", "A41"), SNPB = c("A22",
"A23", "A24", "A22", "A22", "A27", "A28", "A29", "A39", "A40",
"A41", "A57", "A57", "A57"), value = c(0.379927, 0.449074, 0.464135,
0.328866, 0.326026, 0.57169, 0.416178, 0.456144, 0.379927, 0.759074,
0.562303, 0.322303, 0.372303, 0.562303), block1 = c("A1|A2|A3|A4|A5|A6",
"A1|A2|A3|A4|A5|A6", "A1|A2|A3|A4|A5|A6", "A1|A2|A3|A4|A5|A6",
"A1|A2|A3|A4|A5|A6", "A22|A23|A24|A25", "A22|A23|A24|A25", "A22|A23|A24|A25",
"A31|A32|A33|A34", "A31|A32|A33|A34", "A31|A32|A33|A34", "A39|A40|A41|A42",
"A39|A40|A41|A42", "A39|A40|A41|A42"), block2 = c("A22|A23|A24|A25",
"A22|A23|A24|A25", "A22|A23|A24|A25", "A22|A23|A24|A25", "A22|A23|A24|A25",
"A27|A28|A29|A30|A31", "A27|A28|A29|A30|A31", "A27|A28|A29|A30|A31",
"A39|A40|A41|A42", "A39|A40|A41|A42", "A39|A40|A41|A42", "A52|A53|A54|A55|A56|A57|A58|A59|A60|A61",
"A52|A53|A54|A55|A56|A57|A58|A59|A60|A61", "A52|A53|A54|A55|A56|A57|A58|A59|A60|A61"
), score_T = c(12L, 25L, 584L, 51L, 64L, 77L, 90L, 103L, 116L,
129L, 142L, 25L, 198L, 356L)), class = "data.frame", row.names = c(NA,
-14L))
这篇关于如何筛选具有多个条件的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!