使用作为(column_name = value)列表给出的条件从R数据帧中提取项目 [英] Extracting items from an R data frame using criteria given as a (column_name = value) list
问题描述
我想根据与其他列中的值相关的条件从数据框中的列中提取项目。 这些标准以列名与值相关联的列表的形式给出。
最终目标是使用这些项目在另一个数据结构中按名称选择列。
下面是一个示例数据框架:
> experiment_plan
lib基因型治疗复制
1 A WT正常1
2 B WT hot 1
3 C mut normal 1
4 D mut hot 1
5 E WT正常2
6 F WT hot 2
7 G mut normal 2
8 H mut hot 2
我的选择条件编码为以下列表:
> ref_condition = list(genotype =WT,treatment =normal)
我想提取lib列中的项目与 ref_condition
匹配,即A和E。
1)我可以在我的选择标准列表中使用名称
使用列进行选择:
>实验_plan [,名称(ref_condition)]
基因型治疗
1 WT正常
2 WT热
3 mut正常
4 mut热
5 WT正常
6 WT hot
7 mut normal
8 mut hot
2 )我可以测试结果行是否符合我的选择条件:
> test_plan [,names(ref_condition)] == ref_condition
基因型治疗
[1,] TRUE TRUE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] FALSE FALSE
[5,] TRUE TRUE
[6,] TRUE FALSE
[7,] FALSE TRUE
[8,] FALSE FALSE
> selection_vector< - apply(experimental_plan [,names(ref_condition)] == ref_condition,1,all)
> selection_vector
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
(我认为这一步,应用
不是特别优雅,必须有更好的方法。)
3)这个布尔向量可用于选择相关行:
> selected_lines< - experimental_plan [selection_vector]
> selected_lines
lib基因型治疗复制
1 A WT正常1
5 E WT正常2
4)从这一点上,我知道如何使用 dplyr
选择我感兴趣的项目:
> lib1< - filter(selected_lines,replicate ==1)%>%select(lib)%>%unlist()
> lib2< - filter(selected_lines,replicate ==2)%>%select(lib)%>%unlist()
> lib1
lib
A
级别:A B C D E F G H
> lib2
lib
E
级别:ABCDEFGH
可以在以前的步骤中使用 dplyr
(或其他聪明的技巧)
5)这些项目恰好对应于另一个数据结构中的列名称(这里名为 counting_data
)。我使用它们来提取相应的列,并将它们放在列表中,与复制数字相关联:
> count_1< - counting_data [,lib1]
> count_2< - counting_data [,lib2]
> list_of_counts< - list(1< - count_1,2< - counting_2)
counters_data 的数据。 p>
有没有办法更加优雅/高效地完成整个过程?
我想你可以用一个key使用data.table。
library(data。表)
test< - data.table(lib = LETTERS [1:8],
genotype = rep(c(WT,WT,mut,mut 2),
treatment = rep(c(normal,hot),4),
replicate = c(rep(1,4),rep(2,4)))
setkeyv(test,c(genotype,treatment) )
ref_condition = list(genotype =WT,treatment =normal)
test [ref_condition,lib]
这给了
[1]AE
您当然可以使用lapply循环测试条件列表。
I would like to extract items from a column in a data frame based on criteria pertaining to values in other columns. These criteria are given in the form of a list associating column names with values. The ultimate goal is to use those items to select columns by name in another data structure.
Here is an example data frame:
> experimental_plan
lib genotype treatment replicate
1 A WT normal 1
2 B WT hot 1
3 C mut normal 1
4 D mut hot 1
5 E WT normal 2
6 F WT hot 2
7 G mut normal 2
8 H mut hot 2
And my selection criteria are encoded as the following list:
> ref_condition = list(genotype="WT", treatment="normal")
I want to extract the items in the "lib" column where the line matches ref_condition
, that is "A" and "E".
1) I can get the columns to use for selection using names
on my list of selection criteria:
> experimental_plan[, names(ref_condition)]
genotype treatment
1 WT normal
2 WT hot
3 mut normal
4 mut hot
5 WT normal
6 WT hot
7 mut normal
8 mut hot
2) I can test whether the resulting lines match my selection criteria:
> experimental_plan[, names(ref_condition)] == ref_condition
genotype treatment
[1,] TRUE TRUE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] FALSE FALSE
[5,] TRUE TRUE
[6,] TRUE FALSE
[7,] FALSE TRUE
[8,] FALSE FALSE
> selection_vector <- apply(experimental_plan[, names(ref_condition)] == ref_condition, 1, all)
> selection_vector
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
(I think this step, with the apply
is not particularly elegant. There must be a better way.)
3) This boolean vector can be used to select the relevant lines:
> selected_lines <- experimental_plan[selection_vector,]
> selected_lines
lib genotype treatment replicate
1 A WT normal 1
5 E WT normal 2
4) From this point on, I know how to use dplyr
to select items I'm interested in:
> lib1 <- filter(selected_lines, replicate=="1") %>% select(lib) %>% unlist()
> lib2 <- filter(selected_lines, replicate=="2") %>% select(lib) %>% unlist()
> lib1
lib
A
Levels: A B C D E F G H
> lib2
lib
E
Levels: A B C D E F G H
Can dplyr
(or other clever techniques) be used in earlier steps?
5) These items happen to correspond to column names in another data structure (named counts_data
here). I use them to extract the corresponding columns and put them in a list, associated with replicate numbers as names:
> counts_1 <- counts_data[, lib1]
> counts_2 <- counts_data[, lib2]
> list_of_counts <- list("1" <- counts_1, "2" <- counts_2)
(Ideally, I would like to generalize the code so that I do not need to know (I mean, "hard-code them") what different values exist in the "replicate" column: there could be any number of replicates for a given combination of "genotype" and "treatment" characteristics, and I want my final list to contain the data from the counts_data
pertaining to the corresponding "lib" items.)
Is there a way to do the whole process more elegantly / efficiently?
I think you can use data.table for this with a key
library(data.table)
test <- data.table(lib = LETTERS[1:8],
genotype = rep(c("WT","WT","mut","mut"),2),
treatment = rep(c("normal","hot"),4),
replicate = c(rep(1,4),rep(2,4)))
setkeyv(test,c("genotype","treatment"))
ref_condition = list(genotype="WT", treatment="normal")
test[ref_condition,lib]
This gives
[1] "A" "E"
You could of course use lapply to loop over a list of test conditions.
这篇关于使用作为(column_name = value)列表给出的条件从R数据帧中提取项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!