R; DPLYR:将数据框列表转换为单个组织的数据框 [英] R; DPLYR: Convert a list of dataframes into a single organized dataframe

查看:123
本文介绍了R; DPLYR:将数据框列表转换为单个组织的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个条目的列表,示例示例如下:

I have a list with multiple entries, an example entry looks like:

> head(gene_sets[[1]])
     patient Diagnosis Eigen_gene ENSG00000080824 ENSG00000166165 ENSG00000211459 ENSG00000198763 ENSG00000198938 ENSG00000198886
1 689_120604        AD -0.5606425           50137           38263          309298          528233          523420          730537
2 412_120503        AD  0.9454632           44536           23333          404316          730342          765963         1168123
3 706_120605        AD  0.6061834           16647           22021          409498          614314          762878         1171747
4 486_120515        AD  0.8164779           21871            9836          518046          697051          613621         1217262
5 469_120514        AD  0.5354927           33460           11651          468223          653745          608259         1115973
6 369_120502        AD -0.8363372           32168           44760          271978          436132          513194          784537

对于这些条目,前三列始终是一致的,并且列的总数各不相同。

For these entries, the first three columns are always consistent and the total number of columns varies.

我想做的就是将整个列表转换为数据框。我需要保留的信息是 set_index 是列表中的条目索引,然后是 Eigen_gene 以外的所有公司名称直到最后一列。

What I would like to do is convert this entire list into a dataframe. The information I need to retain is set_index being the index of entry in the list, then all the colnames from beyond Eigen_gene until the last column.

我可以想到使用循环的解决方案,但是我想要一个 dplyr / reshape 解决方案。

I can think of solutions using loops, however I would like a dplyr/reshape solution.

要澄清一下,如果我们有一个伪造的输入,看起来像:

To clarify, if we had a fake input that looked like:

> list(data.frame(patient= c(1,2,3), Diagnosis= c("AD","Control", "AD"), Eigen_gene= c(1.1, 2.3, 4.3), geneA= c(1,1,1), geneC= c(2,1,3), geneB= c(2,39,458)))
[[1]]
  patient Diagnosis Eigen_gene geneA geneC geneB
1       1        AD        1.1     1     2     2
2       2   Control        2.3     1     1    39
3       3        AD        4.3     1     3   458

所需的输出看起来像这样(我仅显示了第一个用于输入的列表条目的示例,输出显示了列表中其他条目的格式也将被格式化):

The desired output would look like this (I have only shown an example of the first list entry for input, the output shows how other entries in the list would also be formatted):

> data.frame(set_index= c(1,1,1,2,2,2,3,3), gene= c("geneA", "geneC", "geneB", "geneF", "geneE", "geneH", "geneT", "geneZ"))
  set_index  gene
1         1 geneA
2         1 geneC
3         1 geneB
4         2 geneF
5         2 geneE
6         2 geneH
7         3 geneT
8         3 geneZ

谢谢!

推荐答案

这里是 tidyverse 的一种解决方案和 purrr 。我扩展了示例输入以生成示例输出。此处的关键功能是 imap ,它是 map2(x,seq_along(x))的简写。有关更多信息,请参见帮助。我们要做的是将一个函数应用于列表及其索引中的每个数据框。因此,我们使用函数〜tibble(set_index = .y,gene = colnames(.x [4:ncol(.x)]))

Here is a solution from the tidyverse and purrr. I extended the example input to produce the example output. The key function here is imap, which is shorthand for map2(x, seq_along(x)). See the help for more. What we want to do is apply a function to each dataframe in the list and its index. So we use the function ~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)])).


  • .x .y function(x,y) purrr 的简写, code> x 和 y 。这使我们可以紧凑地引用该函数的参数。请参见?map2

  • set_index = .y 创建第一列,用当前数据帧的索引填充它(有用的重复它是正确的长度)

  • gene = colnames(.x [4:ncol(.x) ]))从基因名称的载体创建第二列。 colnames 获取数据帧的变量名,但我们将前三个子集排除在外。

  • 如果我们只有 imap ,我们将获得数据帧列表。 imap_dfr 只是获取该列表并将它们作为行绑定在一起,从而产生我们想要的输出。 (相当于之后调用 bind_rows

  • ~, .x and .y are purrr shorthands for function(x, y), x and y. This lets us refer to the arguments for the function compactly. See ?map2.
  • set_index = .y creates the first column and fills it with the index of the current dataframe (it's usefully repeated to be the right length)
  • gene = colnames(.x[4:ncol(.x)])) creates the second column from a vector of the gene names. colnames gets the variable names of the data frame, but we subset to exclude the first three.
  • If we had just imap, we would get a list of data frames. The imap_dfr just takes that list and binds them together as rows, producing our desired output. (equivalent to calling bind_rows afterwards)
library(tidyverse)
gene_list <- list(
  data.frame(
    patient= c(1,2,3),
    Diagnosis= c("AD","Control", "AD"),
    Eigen_gene= c(1.1, 2.3, 4.3),
    geneA= c(1,1,1),
    geneC= c(2,1,3),
    geneB= c(2,39,458)
  ),
  data.frame(
    patient= c(1,2,3),
    Diagnosis= c("AD","Control", "AD"),
    Eigen_gene= c(1.1, 2.3, 4.3),
    geneF= c(1,1,1),
    geneE= c(2,1,3),
    geneH= c(2,39,458)
  ),
  data.frame(
    patient= c(1,2,3),
    Diagnosis= c("AD","Control", "AD"),
    Eigen_gene= c(1.1, 2.3, 4.3),
    geneT= c(1,1,1),
    geneZ= c(2,1,3)
  )
)

output <- gene_list %>%
  imap_dfr(~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)])))
output
#> # A tibble: 8 x 2
#>   set_index gene 
#>       <int> <chr>
#> 1         1 geneA
#> 2         1 geneC
#> 3         1 geneB
#> 4         2 geneF
#> 5         2 geneE
#> 6         2 geneH
#> 7         3 geneT
#> 8         3 geneZ

reprex包(v0.2.0)。

这篇关于R; DPLYR:将数据框列表转换为单个组织的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆