用dplyr按组连接字符串以用于多列 [英] Concatenate strings by group with dplyr for multiple columns

查看：107 发布时间：2020/10/15 19:42:53 r string data.table dplyr concatenation

本文介绍了用dplyr按组连接字符串以用于多列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要按组将多个列的字符串连接起来。我意识到这个问题的版本已经被问过几次了（请参阅通过唯一标识符进行汇总并将相关值串联到字符串中），但是它们通常涉及连接单个列的值。

Hi I need to concatenate strings by groups for multiple columns. I realise that versions of this question has been asked several times (see Aggregating by unique identifier and concatenating related values into a string), but they usually involve concatenating values of a single column.

我的数据集类似于：

Sample  group   Gene1   Gene2   Gene3
A       1       a       NA      NA
A       2       b       NA      NA
B       1       NA      c       NA
C       1       a       NA      d
C       2       b       NA      e
C       3       c       NA      NA

我想将其放入格式，其中每个样本仅占用一行（组列是可选的）：

I want to get it into a format where each samples takes only 1 row (the group column is optional):

Sample  group   Gene1   Gene2   Gene3
A       1,2     a,b     NA      NA
B       1       NA      c       NA
C       1,2,3   a,b,c   NA      d,e

由于g的个数ene可以增加到数千个，我不能简单地指定要连接的列。
我知道 aggregate 或 dplyr 可用于获取组，但我不知道如何

Since the number of genes can go up to the thousands, I can't simply specify the columns that I wish to concatenate. I know aggregate or dplyr can be used to get the groups but I can't figure out how to do it for multiple columns.

预先感谢！

由于我的数据集非常大，包含数千个基因，因此我意识到dplyr太慢了。我一直在尝试使用data.table，下面的代码也可以得到我想要的：

As my dataset is very large containing thousands of genes, I realised dplyr is too slow. I've been experimenting with data.table and the following code can also get what I want:

setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]

现在输出为：

   Sample group Gene1 Gene2 Gene3
1:      A   1,2   a,b            
2:      B     1           c      
3:      C 1,2,3 a,b,c         d,e

感谢您的所有帮助！

推荐答案

为此， summarise_all ， summarise_at 和 summarise_if 函数。使用 summarise_all ：

For these purposes, there are the summarise_all, summarise_at, and summarise_if functions. Using summarise_all:

df %>%
  group_by(Sample) %>%
  summarise_all(funs(paste(na.omit(.), collapse = ",")))

# A tibble: 3 × 5
  Sample group Gene1 Gene2 Gene3
   <chr> <chr> <chr> <chr> <chr>
1      A   1,2   a,b            
2      B     1           c      
3      C 1,2,3 a,b,c         d,e

这篇关于用dplyr按组连接字符串以用于多列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用dplyr按组连接字符串以用于多列 [英] Concatenate strings by group with dplyr for multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用dplyr按组连接字符串以用于多列 [英] Concatenate strings by group with dplyr for multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭