连接不重复数据帧r [英] Concatenate without duplicates dataframe r

查看:164
本文介绍了连接不重复数据帧r的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想连接某些列。



我的问题是这些列中的文本可能包含或不包含重复的信息。我想删除重复的内容,以便只保留相关信息。



例如,如果我有一个数据框架,例如:

  Animal1 Animal2标签
1猫狗海豚19
2狗猫猫72
3 pilchard 26 koala 26
4 newt蝙蝠81蝙蝠81

你可以看到在第2行中包含'cat'在Animal1和Animal2两列中。在第3行中,数字26位于Animal1和Label两列中。而在第4行中,Animal2和Label列中的信息已经按照Animal1的顺序包含。



所以通过使用粘贴功能,我可以连接列...

  data1<  -  paste(data $ Animal1,data $ Animal2,data $ Label,sep = )

但是,我还没有管理没有删除重复。我得到的输出当然是来自我的连接:

  Output1 
1 cat dog dolphin 19
2狗猫猫72
3 pilchard 26 koala 26
4蝾螈81蝙蝠81

第1行很好,但其他行包含如上所述的重复项。



我希望的输出是:

 输出1 
1猫狗海豚19
2狗猫72
3 pilchard koala 26
4 newt bat 8​​1

我连接后尝试删除重复项。我知道在一个字符串内,你可以做一些像下面的例子(例如在R中删除字符串中的重复单词)。

  d<  -  unlist(strsplit(data1,split =))
贴(d [ (复制(d))],collapse ='')

只是使用一个字符串,但我无法将其应用于整个列,因为我收到一个错误'意大利符号'引用方括号。



我看到那里也是unique()函数,例如删除一行中的重复字符串使用R删除相反的重复

  reduce_row = function(i){
split = strsplit(i,split =,)[[1]]
贴(唯一(split),collapse =,)
}
data1 $ v2 = apply(data1,1,reduce_row)

使用这些例子,但尚未成功。



任何帮助都将非常感谢。

解决方案

完成 data1< - 粘贴(数据$ Animal1,数据$ Animal2,数据$ Label,sep =)

  data.frame(Output1 = vapply(strsplit(data1,+),function(x) (x),collapse =),character(1)))
#Output1
#1猫狗海豚19
#2狗猫7 2
#3 pilchard 26 koala
#4 newt bat 8​​1


I have a dataframe where I would like to concatenate certain columns.

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

For example, if I had a data frame such as:

  Animal1         Animal2        Label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

So by using the paste function I can concatenate the columns...

data1 <- paste(data$Animal1, data$Animal2, data$Label, sep = " ")

However, I haven't managed yet to remove duplicates. The output I'm getting is of course just from my concatenation:

  Output1
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81

Row 1 is fine, but the other rows contain duplicates as described above.

The output I would desire is:

  Output1
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

I tried removing duplicates after concatenating. I know that within a string you can do something like the example below (e.g. Removing duplicate words in a string in R).

d <- unlist(strsplit(data1, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

This did work for me when I was just using a string but I couldn't apply it to the whole column as I received an error 'unexpected symbol' referring to the square brackets.

I have seen that there is also the unique() function e.g. Remove Duplicated String in a Row, Deleting reversed duplicates with R

reduce_row = function(i) {
  split = strsplit(i, split=", ")[[1]]
  paste(unique(split), collapse = ", ") 
}
data1$v2 = apply(data1, 1, reduce_row)

I tried to use these examples, but as yet have not been successful.

Any assistance would be very much appreciated.

解决方案

After you've done data1 <- paste(data$Animal1, data$Animal2, data$Label, sep = " ") :

data.frame(Output1 = vapply(strsplit(data1, " +"), function(x) paste(unique(x), collapse = " "), character(1)))
#              Output1
# 1 cat dog dolphin 19
# 2         dog cat 72
# 3  pilchard 26 koala
# 4        newt bat 81

这篇关于连接不重复数据帧r的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆