合并几乎相同的行以过滤NA和较短的字符串 [英] merge almost identical rows filtering NAs and shorter strings

查看:66
本文介绍了合并几乎相同的行以过滤NA和较短的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个数据帧中有一些几乎相同的行,请参见例如,建立它们相关的标准是本示例中的一些变量 sel1,sel2,其他变量var1和var2必须通过以下方式进行集成准则:1.丢弃NA,或2.丢弃较短的字符串(在示例中为var2)。因此,到目前为止,我已经放弃了NA,但没有找到同时丢弃较短字符串的方法。字符串很复杂,可能包含逗号,空格和几种字符。

I have some almost identical rows in a dataframe, see ex., the criteria to establish they are related are some variables "sel1,sel2" in this example, the other variables, var1 and var2, must be integrated by the following criteria: 1. discarding NA, or 2. discarding the shorter string (in var2 in the example). So, until now I have discarded the NA, but not find a way to at the same time discard the shorter string. The strings are complex and might have commas, spaces and several types of characters.

df <- read.table(text = 
            "  sel1 sel2 var1    var2
1   pseudorepeated1   x    NA    \"longer string\"   # keep longer string instead of shortstring
2   pseudorepeated1   x    2     \"short string\"    # keep 2 instead of NA
3   pseudorepeated2   y    NA    \"longer string 2\" # keep longer string 2
4   pseudorepeated2   y    4     \"short string2\"   # keep 4
5                 3   x    gs    as
6                 4   y    fg    df
7                 5   x    eg    af
8                 6   y    df    fd", header = TRUE, stringsAsFactors=F)
df
df[is.na(df)] <- ""
df2<-aggregate(. ~ sel1 + sel2,data=df,FUN=function(X)paste(unique((X))) )
paste_noNA <- function(x,sep=", ") 
  gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) )
df3<-as.data.frame(lapply(df2, function(X) unlist(lapply(X, function(x) paste_noNA(x)) ) ), 
                           stringsAsFactors=F )

此表中的预期输出不包含短字符串文本。

The expected output does not have the ", short string" text in this table.

df3
               sel1 sel2 var1                        var2
1.1               3    x   gs                          as
1.3               5    x   eg                          af
1.5 pseudorepeated1    x    2 longer string, short string# only longer string desired
2.2               4    y   fg                          df
2.4               6    y   df                          fd
2.6 pseudorepeated2    y    4 longer string 2, short string2# only longer string 2 desired


推荐答案

sel1 sel2 并删除var1中的 NA ,并用 var2 中的较长字符串替换较短的字符串。最后,删除其中的重复项。

group by sel1 and sel2 and remove NA in var1, and replace shorter string with longer string in var2. Finally, remove the duplicates in it.

library('data.table')
setDT(df)
df[, `:=` ( var2 = { temp <- nchar(var2); var2[ temp == max(temp) ] },
            var1 = na.omit(var1)),
   by = .(sel1, sel2)]
df[ !duplicated( df ), ]

#               sel1 sel2 var1         var2
# 1: pseudorepeated1    x    2 longerstring
# 2: pseudorepeated2    y    4 longerstring
# 3:               3    x   gs           as
# 4:               4    y   fg           df
# 5:               5    x   eg           af
# 6:               6    y   df           fd

编辑:具有许多列

数据:

df <- read.table(text = 
                   "  sel1 sel2 var1    var2
                 1   pseudorepeated1   x    NA    longerstring   # keep longerstring instead of shortstring
                 2   pseudorepeated1   x    2     shortstring    # keep 2 instead of NA
                 3   pseudorepeated2   y    NA    longerstring   # same as above
                 4   pseudorepeated2   y    4     shortstring    # same as above
                 5                 3   x    gs    as
                 6                 4   y    fg    df
                 7                 5   x    eg    af
                 8                 6   y    df    fd", header = TRUE, stringsAsFactors=F)

library('data.table')
setDT(df)
df$var3 <- df$var2
df$var4 <- df$var2

代码:

for( nm in c( "var1", "var2", "var3", "var4") ){
  df[,  paste0(nm) := { temp <- na.omit(get(nm)); temp[ nchar(temp) == max(nchar(temp)) ] },
     by = .(sel1, sel2)]
}
df[ !duplicated( df ), ]

输出:

#               sel1 sel2 var1         var2         var3         var4
# 1: pseudorepeated1    x    2 longerstring longerstring longerstring
# 2: pseudorepeated2    y    4 longerstring longerstring longerstring
# 3:               3    x   gs           as           as           as
# 4:               4    y   fg           df           df           df
# 5:               5    x   eg           af           af           af
# 6:               6    y   df           fd           fd           fd

编辑2:避免 for 循环,并使用 .SDcols 和列名称变量

EDIT 2: avoiding for loop, and using .SDcols and a column names variable

col_nm <- c( "var1", "var2", "var3", "var4")

df[,  paste0(col_nm) := lapply( .SD, function(x) { 
  temp <- na.omit(x)
  temp[ nchar(temp) == max(nchar(temp)) ] } ),
  by = .(sel1, sel2), 
  .SDcols = col_nm ]  

df[ !duplicated( df ), ]

这篇关于合并几乎相同的行以过滤NA和较短的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆