R; DPLYR:将数据框列表转换为单个组织的数据框 [英] R; DPLYR: Convert a list of dataframes into a single organized dataframe
问题描述
我有一个包含多个条目的列表,示例示例如下:
I have a list with multiple entries, an example entry looks like:
> head(gene_sets[[1]])
patient Diagnosis Eigen_gene ENSG00000080824 ENSG00000166165 ENSG00000211459 ENSG00000198763 ENSG00000198938 ENSG00000198886
1 689_120604 AD -0.5606425 50137 38263 309298 528233 523420 730537
2 412_120503 AD 0.9454632 44536 23333 404316 730342 765963 1168123
3 706_120605 AD 0.6061834 16647 22021 409498 614314 762878 1171747
4 486_120515 AD 0.8164779 21871 9836 518046 697051 613621 1217262
5 469_120514 AD 0.5354927 33460 11651 468223 653745 608259 1115973
6 369_120502 AD -0.8363372 32168 44760 271978 436132 513194 784537
对于这些条目,前三列始终是一致的,并且列的总数各不相同。
For these entries, the first three columns are always consistent and the total number of columns varies.
我想做的就是将整个列表转换为数据框。我需要保留的信息是 set_index
是列表中的条目索引,然后是 Eigen_gene
以外的所有公司名称直到最后一列。
What I would like to do is convert this entire list into a dataframe. The information I need to retain is set_index
being the index of entry in the list, then all the colnames from beyond Eigen_gene
until the last column.
我可以想到使用循环的解决方案,但是我想要一个 dplyr / reshape
解决方案。
I can think of solutions using loops, however I would like a dplyr/reshape
solution.
要澄清一下,如果我们有一个伪造的输入,看起来像:
To clarify, if we had a fake input that looked like:
> list(data.frame(patient= c(1,2,3), Diagnosis= c("AD","Control", "AD"), Eigen_gene= c(1.1, 2.3, 4.3), geneA= c(1,1,1), geneC= c(2,1,3), geneB= c(2,39,458)))
[[1]]
patient Diagnosis Eigen_gene geneA geneC geneB
1 1 AD 1.1 1 2 2
2 2 Control 2.3 1 1 39
3 3 AD 4.3 1 3 458
所需的输出看起来像这样(我仅显示了第一个用于输入的列表条目的示例,输出显示了列表中其他条目的格式也将被格式化):
The desired output would look like this (I have only shown an example of the first list entry for input, the output shows how other entries in the list would also be formatted):
> data.frame(set_index= c(1,1,1,2,2,2,3,3), gene= c("geneA", "geneC", "geneB", "geneF", "geneE", "geneH", "geneT", "geneZ"))
set_index gene
1 1 geneA
2 1 geneC
3 1 geneB
4 2 geneF
5 2 geneE
6 2 geneH
7 3 geneT
8 3 geneZ
谢谢!
推荐答案
这里是 tidyverse
的一种解决方案和 purrr
。我扩展了示例输入以生成示例输出。此处的关键功能是 imap
,它是 map2(x,seq_along(x))
的简写。有关更多信息,请参见帮助。我们要做的是将一个函数应用于列表及其索引中的每个数据框。因此,我们使用函数〜tibble(set_index = .y,gene = colnames(.x [4:ncol(.x)]))
。
Here is a solution from the tidyverse
and purrr
. I extended the example input to produce the example output. The key function here is imap
, which is shorthand for map2(x, seq_along(x))
. See the help for more. What we want to do is apply a function to each dataframe in the list and its index. So we use the function ~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)]))
.
-
〜
,.x
和.y
是function(x,y)
的purrr
的简写, code> x 和y
。这使我们可以紧凑地引用该函数的参数。请参见?map2
。 -
set_index = .y
创建第一列,用当前数据帧的索引填充它(有用的重复它是正确的长度) -
gene = colnames(.x [4:ncol(.x) ]))
从基因名称的载体创建第二列。colnames
获取数据帧的变量名,但我们将前三个子集排除在外。 - 如果我们只有
imap
,我们将获得数据帧列表。imap_dfr
只是获取该列表并将它们作为行绑定在一起,从而产生我们想要的输出。 (相当于之后调用bind_rows
)
~
,.x
and.y
arepurrr
shorthands forfunction(x, y)
,x
andy
. This lets us refer to the arguments for the function compactly. See?map2
.set_index = .y
creates the first column and fills it with the index of the current dataframe (it's usefully repeated to be the right length)gene = colnames(.x[4:ncol(.x)]))
creates the second column from a vector of the gene names.colnames
gets the variable names of the data frame, but we subset to exclude the first three.- If we had just
imap
, we would get a list of data frames. Theimap_dfr
just takes that list and binds them together as rows, producing our desired output. (equivalent to callingbind_rows
afterwards)
library(tidyverse)
gene_list <- list(
data.frame(
patient= c(1,2,3),
Diagnosis= c("AD","Control", "AD"),
Eigen_gene= c(1.1, 2.3, 4.3),
geneA= c(1,1,1),
geneC= c(2,1,3),
geneB= c(2,39,458)
),
data.frame(
patient= c(1,2,3),
Diagnosis= c("AD","Control", "AD"),
Eigen_gene= c(1.1, 2.3, 4.3),
geneF= c(1,1,1),
geneE= c(2,1,3),
geneH= c(2,39,458)
),
data.frame(
patient= c(1,2,3),
Diagnosis= c("AD","Control", "AD"),
Eigen_gene= c(1.1, 2.3, 4.3),
geneT= c(1,1,1),
geneZ= c(2,1,3)
)
)
output <- gene_list %>%
imap_dfr(~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)])))
output
#> # A tibble: 8 x 2
#> set_index gene
#> <int> <chr>
#> 1 1 geneA
#> 2 1 geneC
#> 3 1 geneB
#> 4 2 geneF
#> 5 2 geneE
#> 6 2 geneH
#> 7 3 geneT
#> 8 3 geneZ
由 reprex包(v0.2.0)。
这篇关于R; DPLYR:将数据框列表转换为单个组织的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!