用dplyr按组连接字符串以用于多列 [英] Concatenate strings by group with dplyr for multiple columns
问题描述
我需要按组将多个列的字符串连接起来。我意识到这个问题的版本已经被问过几次了(请参阅通过唯一标识符进行汇总并将相关值串联到字符串中),但是它们通常涉及连接单个列的值。
Hi I need to concatenate strings by groups for multiple columns. I realise that versions of this question has been asked several times (see Aggregating by unique identifier and concatenating related values into a string), but they usually involve concatenating values of a single column.
我的数据集类似于:
Sample group Gene1 Gene2 Gene3
A 1 a NA NA
A 2 b NA NA
B 1 NA c NA
C 1 a NA d
C 2 b NA e
C 3 c NA NA
我想将其放入格式,其中每个样本仅占用一行(组列是可选的):
I want to get it into a format where each samples takes only 1 row (the group column is optional):
Sample group Gene1 Gene2 Gene3
A 1,2 a,b NA NA
B 1 NA c NA
C 1,2,3 a,b,c NA d,e
由于g的个数ene可以增加到数千个,我不能简单地指定要连接的列。
我知道 aggregate
或 dplyr
可用于获取组,但我不知道如何
Since the number of genes can go up to the thousands, I can't simply specify the columns that I wish to concatenate.
I know aggregate
or dplyr
can be used to get the groups but I can't figure out how to do it for multiple columns.
预先感谢!
由于我的数据集非常大,包含数千个基因,因此我意识到dplyr太慢了。我一直在尝试使用data.table,下面的代码也可以得到我想要的:
As my dataset is very large containing thousands of genes, I realised dplyr is too slow. I've been experimenting with data.table and the following code can also get what I want:
setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]
现在输出为:
Sample group Gene1 Gene2 Gene3
1: A 1,2 a,b
2: B 1 c
3: C 1,2,3 a,b,c d,e
感谢您的所有帮助!
推荐答案
为此, summarise_all
, summarise_at
和 summarise_if
函数。使用 summarise_all
:
For these purposes, there are the summarise_all
, summarise_at
, and summarise_if
functions. Using summarise_all
:
df %>%
group_by(Sample) %>%
summarise_all(funs(paste(na.omit(.), collapse = ",")))
# A tibble: 3 × 5
Sample group Gene1 Gene2 Gene3
<chr> <chr> <chr> <chr> <chr>
1 A 1,2 a,b
2 B 1 c
3 C 1,2,3 a,b,c d,e
这篇关于用dplyr按组连接字符串以用于多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!