合并列，同时忽略重复项和NA [英] Combining columns, while ignoring duplicates and NAs

查看：86 发布时间：2020/10/16 23:37:44 r dataframe dplyr tidyr

本文介绍了合并列，同时忽略重复项和NA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，如下所示，我想合并两列，即 Var1 和 Var2 。我希望合并的列（ Var3 ）不包含< alpha>< digit> 的重复项。例如，如果 Var1 == A1 和 Var2 == A1 ，则 Var3 == A1 ，但不是 Var3 == A1-A1 ，或者如果 Var1 == A4-E9 和 Var2 == A4 ，因此 Var3 == A4-E9 但不是 Var3 == A4- E9-A4

I have a dataframe as follows and I would like to combine two columns, namely Var1 and Var2. I want the combined column (Var3) to contain no duplicates of <alpha><digit>. i.e. if Var1 == A1 and Var2 == A1, hence Var3 == A1 but not Var3 == A1-A1 or if Var1 == A4-E9 and Var2 == A4, hence Var3 == A4-E9 but not Var3 == A4-E9-A4

df <- read.table(header = TRUE, text = 
"id  Var1    Var2
A   A1       A1
B   F2       A2
C   NA       A3
D   A4-E9    A4
E   E5       A5
F   NA       NA
G   B2-R4    A3-B2
H   B3-B4    E1-G5", stringsAsFactors = FALSE)

以下是我的代码。我想改善其可读性，并摆脱 Var3 NA >，即 A3-NA 。

The following is my code. I would like to improve on its readability as well as get rid of the NA that is present in row 3's entry for Var3, i.e A3-NA.

library(dplyr)
library(tidyr)
df %>% 
  mutate(Var3 = paste(Var1, Var2, sep = "-"))  %>%
  separate_rows(Var3, sep = "-") %>%
  group_by(id, Var3) %>%
  slice(1) %>%
  group_by(id) %>%
  mutate(Var3 = paste(unlist(Var3[!is.na(Var3)]), collapse = "-")) %>%
  slice(1) %>%
  ungroup

这是我想要的输出：

# A tibble: 8 x 4
     id  Var1  Var2        Var3
  <chr> <chr> <chr>       <chr>
1     A    A1    A1          A1
2     B    F2    A2       A2-F2
3     C  <NA>    A3          A3
4     D A4-E9    A4       A4-E9
5     E    E5    A5       A5-E5
6     F  <NA>  <NA>        <NA>
7     G B2-R4 A3-B2    A3-B2-R4
8     H B3-B4 E1-G5 B3-B4-E1-G5

推荐答案

如果输出'df1'，则删除-之后的'NA' 和 sub

if 'df1' is the output, then we remove the 'NA' that follows a - with sub

df1 %>% 
    mutate(Var3 = sub("-NA", "", Var3))
# A tibble: 8 x 4
#     id  Var1  Var2        Var3
#  <chr> <chr> <chr>       <chr>
#1     A    A1    A1          A1
#2     B    F2    A2       A2-F2
#3     C  <NA>    A3          A3
#4     D A4-E9    A4       A4-E9
#5     E    E5    A5       A5-E5
#6     F  <NA>  <NA>          NA
#7     G B2-R4 A3-B2    A3-B2-R4
#8     H B3-B4 E1-G5 B3-B4-E1-G5

我们也可以使用 tidyverse 通过 gather 转换为'long'格式，然后使用 separate_rows 拆分'value'列，并按'id'分组，总结在 Var3列中，通过粘贴将排序 ed'Var3'的 unique 元素和 left_join 与原始数据集'df'






We can also do this slightly differently with tidyverse by gather into 'long' format, then split the 'value' column using separate_rows, grouped by 'id', summarise the 'Var3' column by pasteing the sorted unique elements of 'Var3' and left_join with the original dataset 'df'
library(tidyverse)
gather(df, key, value, -id) %>%
       separate_rows(value)  %>%
       group_by(id) %>% 
       summarise(Var3 = paste(sort(unique(value)), collapse='-')) %>% 
       mutate(Var3 = replace(Var3, Var3=='', NA)) %>% 
       left_join(df, .)
#   id  Var1  Var2        Var3
#1  A    A1    A1          A1
#2  B    F2    A2       A2-F2
#3  C  <NA>    A3          A3
#4  D A4-E9    A4       A4-E9
#5  E    E5    A5       A5-E5
#6  F  <NA>  <NA>        <NA>
#7  G B2-R4 A3-B2    A3-B2-R4
#8  H B3-B4 E1-G5 B3-B4-E1-G5

注意：％>％甚至可以使一个简单的代码显示在多行中，但是如果需要时，我们可以将所有这些语句放在一行中并作为单行 
NOTE: The %>% makes even a simple code to appear in multiple lines, but if required, we can put all those statements in a single line and term as one-liner
这里是单线纸
library(data.table)
setDT(df)[, Var3 := paste(sort(unique(unlist(strsplit(unlist(.SD),"-")))), collapse="-"), id]


                        这篇关于合并列，同时忽略重复项和NA的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

合并列，同时忽略重复项和NA [英] Combining columns, while ignoring duplicates and NAs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

合并列，同时忽略重复项和NA [英] Combining columns, while ignoring duplicates and NAs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭