在保留数据帧结构的同时合并R中的重复字符 [英] Merge duplicate characters in R while preserving data frame structure

查看:103
本文介绍了在保留数据帧结构的同时合并R中的重复字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个神经网络的玩具边缘列表,看起来像这样:

I have a toy edgelist for Neural Networking that looks like this:

df<-c("Group1", "Group1", "Group2", "Group1, Group3", "Group1, Group3", 
"Group3", "Group3, Group4", "Group3, Group4")

    V1
1   Group1
2   Group1
3   Group2
4   Group1, Group3
5   Group1, Group3
6   Group3
7   Group3, Group4
8   Group3, Group4

我需要保留数据的8行结构(在第1行和第2行中保留单个重复元素,例如Group1),但是我想:

I need to preserve the 8-row structure of the data (with the individual duplicate elements like Group1 in rows 1 & 2), but I want to:

1)标识由逗号分隔的重复条目的实例(即"Group1, Group3""Group3, Group4")

1) Identify instances of duplicate entries that are delimited by a comma (i.e. "Group1, Group3" and "Group3, Group4")

2)对于这些实例,找到一种合并值的方法,以便在第一个重复行中保留一个唯一值,并在第二个重复行中保留第二个唯一值,如下所示:

2) For these instances, find a way to merge the values so one unique value is left in the first duplicate row, and the second unique value is left in the second duplicate row, as so:

    V1
1   Group1
2   Group1
3   Group2
4   Group1 <- Group3 is dropped
5   Group3 <- Group1 is dropped
6   Group3
7   Group3 <- Group4 is dropped
8   Group4 <- Group3 is dropped

所有重复项都以两个的倍数出现,因此不存在奇数重复且仅包含两个值的问题,等等.

All of the duplicates occur in multiples of two, so there aren't any issues with an odd number of repetitions with only two values, etc.

编辑:

供将来参考,如果边缘列表具有这样的非顺序重复项,我该怎么办:

For future reference, what could I do if the edgelist had non-sequential duplicates like so:

df<-c("Group1", "Group1, Group3", "Group2", "Group1, Group3", "Group3", 
      "Group3, Group4", "Group3", "Group3, Group4")
    V1
1   Group1
2   Group1, Group3
3   Group2
4   Group1, Group3
5   Group3
6   Group3, Group4
7   Group3
8   Group3, Group4

在这种情况下,提供的解决方案将无法正常工作.另外,由于行的位置对于联网至关重要,因此无法排序.有什么建议吗?

The solutions offered wouldn't be able to work work in this situation. Also, since the position of the rows is crucial for networking, it can't be sorted. Any suggestions?

推荐答案

删除重复项,然后以逗号分隔.

Remove duplicates and then split at comma.

unlist(strsplit(df[!(ave(seq_along(df), df, FUN = seq_along) == 2 & grepl(",", df))], ", "))
#[1] "Group1" "Group1" "Group2" "Group1" "Group3" "Group3" "Group3" "Group4"

df如果可能不会将重复项放在一起,则可能需要先进行排序.

df may need to be sorted first if there is a chance duplicates won't be together.

这是使用mapply的另一种方法,无论df

Here's another approach using mapply that should work regardless of the order of df

df<-c("Group1", "Group1, Group3", "Group2", "Group1, Group3", "Group3", 
      "Group3, Group4", "Group3", "Group3, Group4")
d = lapply(unique(df), function(x) strsplit(x, ", ?"))
ind = match(df, unique(df))
grp = ifelse(grepl(",", df), ave(seq_along(df), df, FUN = seq_along), 1)
df2 = mapply(function(i, g) d[[i]][[1]][g], ind, grp)
data.frame(df, df2)
#>               df    df2
#> 1         Group1 Group1
#> 2 Group1, Group3 Group1
#> 3         Group2 Group2
#> 4 Group1, Group3 Group3
#> 5         Group3 Group3
#> 6 Group3, Group4 Group3
#> 7         Group3 Group3
#> 8 Group3, Group4 Group4

这篇关于在保留数据帧结构的同时合并R中的重复字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆