合并列,同时忽略重复项和NA [英] Combining columns, while ignoring duplicates and NAs
问题描述
我有一个数据框,如下所示,我想合并两列,即 Var1
和 Var2
。我希望合并的列( Var3
)不包含< alpha>< digit>
的重复项。例如,如果 Var1 == A1
和 Var2 == A1
,则 Var3 == A1
,但不是 Var3 == A1-A1
,或者如果 Var1 == A4-E9
和 Var2 == A4
,因此 Var3 == A4-E9
但不是 Var3 == A4- E9-A4
I have a dataframe as follows and I would like to combine two columns, namely Var1
and Var2
. I want the combined column (Var3
) to contain no duplicates of <alpha><digit>
. i.e. if Var1 == A1
and Var2 == A1
, hence Var3 == A1
but not Var3 == A1-A1
or if Var1 == A4-E9
and Var2 == A4
, hence Var3 == A4-E9
but not Var3 == A4-E9-A4
df <- read.table(header = TRUE, text =
"id Var1 Var2
A A1 A1
B F2 A2
C NA A3
D A4-E9 A4
E E5 A5
F NA NA
G B2-R4 A3-B2
H B3-B4 E1-G5", stringsAsFactors = FALSE)
以下是我的代码。我想改善其可读性,并摆脱 Var3
NA >,即 A3-NA
。
The following is my code. I would like to improve on its readability as well as get rid of the NA
that is present in row 3's entry for Var3
, i.e A3-NA
.
library(dplyr)
library(tidyr)
df %>%
mutate(Var3 = paste(Var1, Var2, sep = "-")) %>%
separate_rows(Var3, sep = "-") %>%
group_by(id, Var3) %>%
slice(1) %>%
group_by(id) %>%
mutate(Var3 = paste(unlist(Var3[!is.na(Var3)]), collapse = "-")) %>%
slice(1) %>%
ungroup
这是我想要的输出:
# A tibble: 8 x 4
id Var1 Var2 Var3
<chr> <chr> <chr> <chr>
1 A A1 A1 A1
2 B F2 A2 A2-F2
3 C <NA> A3 A3
4 D A4-E9 A4 A4-E9
5 E E5 A5 A5-E5
6 F <NA> <NA> <NA>
7 G B2-R4 A3-B2 A3-B2-R4
8 H B3-B4 E1-G5 B3-B4-E1-G5
推荐答案
如果输出'df1',则删除-之后的'NA'
和 sub
if 'df1' is the output, then we remove the 'NA' that follows a -
with sub
df1 %>%
mutate(Var3 = sub("-NA", "", Var3))
# A tibble: 8 x 4
# id Var1 Var2 Var3
# <chr> <chr> <chr> <chr>
#1 A A1 A1 A1
#2 B F2 A2 A2-F2
#3 C <NA> A3 A3
#4 D A4-E9 A4 A4-E9
#5 E E5 A5 A5-E5
#6 F <NA> <NA> NA
#7 G B2-R4 A3-B2 A3-B2-R4
#8 H B3-B4 E1-G5 B3-B4-E1-G5
我们也可以使用 tidyverse
通过 gather
转换为'long'格式,然后使用 separate_rows
拆分'value'列,并按'id'分组,总结
在 Var3列中,通过粘贴
将排序
ed'Var3'的 unique
元素和 left_join
与原始数据集'df'
We can also do this slightly differently with tidyverse
by gather
into 'long' format, then split the 'value' column using separate_rows
, grouped by 'id', summarise
the 'Var3' column by paste
ing the sort
ed unique
elements of 'Var3' and left_join
with the original dataset 'df'
library(tidyverse)
gather(df, key, value, -id) %>%
separate_rows(value) %>%
group_by(id) %>%
summarise(Var3 = paste(sort(unique(value)), collapse='-')) %>%
mutate(Var3 = replace(Var3, Var3=='', NA)) %>%
left_join(df, .)
# id Var1 Var2 Var3
#1 A A1 A1 A1
#2 B F2 A2 A2-F2
#3 C <NA> A3 A3
#4 D A4-E9 A4 A4-E9
#5 E E5 A5 A5-E5
#6 F <NA> <NA> <NA>
#7 G B2-R4 A3-B2 A3-B2-R4
#8 H B3-B4 E1-G5 B3-B4-E1-G5
注意:%>%
甚至可以使一个简单的代码显示在多行中,但是如果需要时,我们可以将所有这些语句放在一行中并作为单行
NOTE: The %>%
makes even a simple code to appear in multiple lines, but if required, we can put all those statements in a single line and term as one-liner
这里是单线纸
library(data.table)
setDT(df)[, Var3 := paste(sort(unique(unlist(strsplit(unlist(.SD),"-")))), collapse="-"), id]
这篇关于合并列,同时忽略重复项和NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!