合并几乎相同的行以过滤NA和较短的字符串 [英] merge almost identical rows filtering NAs and shorter strings
问题描述
我在一个数据帧中有一些几乎相同的行,请参见例如,建立它们相关的标准是本示例中的一些变量 sel1,sel2,其他变量var1和var2必须通过以下方式进行集成准则:1.丢弃NA,或2.丢弃较短的字符串(在示例中为var2)。因此,到目前为止,我已经放弃了NA,但没有找到同时丢弃较短字符串的方法。字符串很复杂,可能包含逗号,空格和几种字符。
I have some almost identical rows in a dataframe, see ex., the criteria to establish they are related are some variables "sel1,sel2" in this example, the other variables, var1 and var2, must be integrated by the following criteria: 1. discarding NA, or 2. discarding the shorter string (in var2 in the example). So, until now I have discarded the NA, but not find a way to at the same time discard the shorter string. The strings are complex and might have commas, spaces and several types of characters.
df <- read.table(text =
" sel1 sel2 var1 var2
1 pseudorepeated1 x NA \"longer string\" # keep longer string instead of shortstring
2 pseudorepeated1 x 2 \"short string\" # keep 2 instead of NA
3 pseudorepeated2 y NA \"longer string 2\" # keep longer string 2
4 pseudorepeated2 y 4 \"short string2\" # keep 4
5 3 x gs as
6 4 y fg df
7 5 x eg af
8 6 y df fd", header = TRUE, stringsAsFactors=F)
df
df[is.na(df)] <- ""
df2<-aggregate(. ~ sel1 + sel2,data=df,FUN=function(X)paste(unique((X))) )
paste_noNA <- function(x,sep=", ")
gsub(", " ,sep, toString(x[!is.na(x) & x!="" & x!="NA"] ) )
df3<-as.data.frame(lapply(df2, function(X) unlist(lapply(X, function(x) paste_noNA(x)) ) ),
stringsAsFactors=F )
此表中的预期输出不包含短字符串文本。
The expected output does not have the ", short string" text in this table.
df3
sel1 sel2 var1 var2
1.1 3 x gs as
1.3 5 x eg af
1.5 pseudorepeated1 x 2 longer string, short string# only longer string desired
2.2 4 y fg df
2.4 6 y df fd
2.6 pseudorepeated2 y 4 longer string 2, short string2# only longer string 2 desired
推荐答案
按 sel1
和 sel2
并删除var1中的 NA
,并用 var2
中的较长字符串替换较短的字符串。最后,删除其中的重复项。
group by sel1
and sel2
and remove NA
in var1, and replace shorter string with longer string in var2
. Finally, remove the duplicates in it.
library('data.table')
setDT(df)
df[, `:=` ( var2 = { temp <- nchar(var2); var2[ temp == max(temp) ] },
var1 = na.omit(var1)),
by = .(sel1, sel2)]
df[ !duplicated( df ), ]
# sel1 sel2 var1 var2
# 1: pseudorepeated1 x 2 longerstring
# 2: pseudorepeated2 y 4 longerstring
# 3: 3 x gs as
# 4: 4 y fg df
# 5: 5 x eg af
# 6: 6 y df fd
编辑:具有许多列
数据:
df <- read.table(text =
" sel1 sel2 var1 var2
1 pseudorepeated1 x NA longerstring # keep longerstring instead of shortstring
2 pseudorepeated1 x 2 shortstring # keep 2 instead of NA
3 pseudorepeated2 y NA longerstring # same as above
4 pseudorepeated2 y 4 shortstring # same as above
5 3 x gs as
6 4 y fg df
7 5 x eg af
8 6 y df fd", header = TRUE, stringsAsFactors=F)
library('data.table')
setDT(df)
df$var3 <- df$var2
df$var4 <- df$var2
代码:
for( nm in c( "var1", "var2", "var3", "var4") ){
df[, paste0(nm) := { temp <- na.omit(get(nm)); temp[ nchar(temp) == max(nchar(temp)) ] },
by = .(sel1, sel2)]
}
df[ !duplicated( df ), ]
输出:
# sel1 sel2 var1 var2 var3 var4
# 1: pseudorepeated1 x 2 longerstring longerstring longerstring
# 2: pseudorepeated2 y 4 longerstring longerstring longerstring
# 3: 3 x gs as as as
# 4: 4 y fg df df df
# 5: 5 x eg af af af
# 6: 6 y df fd fd fd
编辑2:避免 for
循环,并使用 .SDcols
和列名称变量
EDIT 2: avoiding for
loop, and using .SDcols
and a column names variable
col_nm <- c( "var1", "var2", "var3", "var4")
df[, paste0(col_nm) := lapply( .SD, function(x) {
temp <- na.omit(x)
temp[ nchar(temp) == max(nchar(temp)) ] } ),
by = .(sel1, sel2),
.SDcols = col_nm ]
df[ !duplicated( df ), ]
这篇关于合并几乎相同的行以过滤NA和较短的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!