当在另一行中发现元素逗号分隔时合并行 [英] Merge rows when element coma separated found in another row

查看:26
本文介绍了当在另一行中发现元素逗号分隔时合并行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,我有一个数据框,例如:

Hello I have a dataframe such as :

  species family    Events      groups
1     SP1      A     10,22          G1
2     SP1      B         7          G2
3     SP1    C,D 4,5,6,1,3 G3,G4,G5,G6
4     SP2      A     22,10          G1
5     SP2    D,C 6,5,4,3,1 G4,G6,G5,G3
6     SP3      C 4,5,3,6,1    G3,G6,G5
7     SP3      E         7          G2
8     SP3      A        10          G1
9     SP4      C        7,22        G12

并且我想简单地为每列中至少有一个重复元素的行合并( species 除外).

and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species).

例如,我将合并行:

species family    Events      groups
SP1      A        10,22       G1
species family    Events      groups
SP2      A        22,10       G1
species family    Events      groups
SP3      A        10          G1

进入

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1

如果我对每一行都做同样的事情,我应该得到预期的输出:

SO if I do the same for each row I should get an expected output as :

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1
SP1,SP3      B,E       7           G2
SP1,SP2,SP3  C,D       1,3,4,5,6   G3,G4,G6,G5
SP4          C         7,22        G12 

请注意,由于SP4的组不存在于其他任何行中,因此尚未与任何行合并.

Note that SP4 has not been merged with any rows since its group was not present in any other rows.

请问有人有主意吗?非常感谢您的帮助和时间

Does someone have an idea please ? Thank you very much for your help and time

这里是数据框,如果有帮助的话:

Here is the dataframe if it can helps:

structure(list(species = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 4L), .Label = c("SP1", "SP2", "SP3", "SP4"), class = "factor"), 
    family = structure(c(1L, 2L, 4L, 1L, 5L, 3L, 6L, 1L, 3L), .Label = c("A", 
    "B", "C", "C,D", "D,C", "E"), class = "factor"), Events = structure(c(2L, 
    7L, 5L, 3L, 6L, 4L, 7L, 1L, 8L), .Label = c("10", "10,22", 
    "22,10", "4,5,3,6,1", "4,5,6,1,3", "6,5,4,3,1", "7", "7,22"
    ), class = "factor"), groups = structure(c(1L, 3L, 4L, 1L, 
    6L, 5L, 3L, 1L, 2L), .Label = c("G1", "G12", "G2", "G3,G4,G5,G6", 
    "G3,G6,G5", "G4,G6,G5,G3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))

我可以做什么并尝试过:

到目前为止,我只知道如何使用dplyr中的类似内容将具有精确重复值的行合并:

So far I only know how to merge rows with exact duplicated value using something like that in dplyr :

desired_df <- df %>%
  group_by_at(vars(-species)) %>%
  summarize(species = toString(species)) %>%
  ungroup() %>%
  select(names(df))

但是这里我们没有确切的重复值,相反,我正在寻找可以在另一行中出现的逗号之间.

but here we do not have exact duplicated values, instead I'm looking for between comma that can be present in another row.

推荐答案

这是所有方法的解决方案(将输入数据框称为 dat ).

Here is an all-tidyverse solution (calling the input data frame dat).

请注意,该解决方案与您提供的期望输出不相同..这是因为您声明的规则是合并每一列至少有一个重复元素的行",但物种除外."根据该规则,第2行和第7行不应合并,因为它们没有共同的 family .

Please note that this solution is not identical to the desired output you gave. This is because you stated the rule is to "merge rows for which there is at least one duplicated element for each column, except species." By that rule, rows 2 and 7 should not be merged because they share no family in common.

首先,将要测试重叠值的三列转换为列表列.现在,这些列的每个元素都是一个列表.我还将 Events 列强制为数字,以便其可以正确排序.

First, convert the three columns we are going to test for overlapping values into list-columns. Now each element of those columns is a list. I also coerced the Events column to a numeric so that it will sort properly.

library(tidyverse)

dat <- dat %>%
  mutate(across(c(family, Events, groups), ~ strsplit(as.character(.), split = ','))) %>%
  mutate(Events = map(Events, as.numeric))

接下来,定义一个函数来折叠数据帧的每一行.该函数采用参数 i ,它是行索引.在该函数中,我们做两件事:

Next, define a function to collapse each row of the data frame. The function takes argument i which is a row index. Within the function, we do two things:

  • 首先,我们使用 pmap_lgl 遍历数据帧的每一行,以检查所有三列 family Events ,和 groups 与行 i 至少具有一个共享值,因此应折叠.例如,如果 i == 1 ,则将为第1,4和8行返回 TRUE .
  • 接下来,我们仅对返回 TRUE 的行过滤 dat ,并将函数应用于这些行的所有列.该函数将这些行中的所有列折叠为以逗号分隔的具有唯一值的字符串.
  • First we use pmap_lgl to iterate across each row of the data frame to check for which rows all the three columns family, Events, and groups have at least one shared value with row i and therefore should be collapsed. For example, if i==1 this will return TRUE for rows 1,4,and 8.
  • Next, we filter dat for only those rows that returned TRUE, and apply a function to all columns of those rows. The function collapses all columns in those rows into comma-separated strings of the sorted unique values.
collapse_rows <- function(i) {
  rows_collapse <- pmap_lgl(dat, function(family, Events, groups, ...) 
    any(dat$family[[i]] %in% family) & any(dat$Events[[i]] %in% Events) & any(dat$groups[[i]] %in% groups))
  dat %>%
    filter(rows_collapse) %>%
    mutate(across(everything(), ~ paste(sort(unique(unlist(.))), collapse = ',')))
}

最后,我们将此功能应用于每个行索引.我们最终得到重复的行,例如,初始输出的行1、4和8将相同.我们使用 distinct 删除所有这些重复项.

Finally we apply this function to each row index. We end up with duplicated rows, for example rows 1, 4, and 8 of the initial output will be identical. We use distinct to remove all of those duplicates.

dat_collapse <- map_dfr(1:nrow(dat), collapse_rows) %>% distinct

最终输出:

     species family    Events      groups
1 SP1,SP2,SP3      A     10,22          G1
2         SP1      B         7          G2
3 SP1,SP2,SP3    C,D 1,3,4,5,6 G3,G4,G5,G6
4         SP3      E         7          G2
5         SP4      C      7,22         G12

这篇关于当在另一行中发现元素逗号分隔时合并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆