当在另一行中发现元素逗号分隔时合并行 [英] Merge rows when element coma separated found in another row
问题描述
你好,我有一个数据框,例如:
Hello I have a dataframe such as :
species family Events groups
1 SP1 A 10,22 G1
2 SP1 B 7 G2
3 SP1 C,D 4,5,6,1,3 G3,G4,G5,G6
4 SP2 A 22,10 G1
5 SP2 D,C 6,5,4,3,1 G4,G6,G5,G3
6 SP3 C 4,5,3,6,1 G3,G6,G5
7 SP3 E 7 G2
8 SP3 A 10 G1
9 SP4 C 7,22 G12
并且我想简单地为每列中至少有一个重复元素的行合并( species
除外).
and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species
).
例如,我将合并行:
species family Events groups
SP1 A 10,22 G1
species family Events groups
SP2 A 22,10 G1
species family Events groups
SP3 A 10 G1
进入
species family Events groups
SP1,SP2,SP3 A 10,22 G1
如果我对每一行都做同样的事情,我应该得到预期的输出:
SO if I do the same for each row I should get an expected output as :
species family Events groups
SP1,SP2,SP3 A 10,22 G1
SP1,SP3 B,E 7 G2
SP1,SP2,SP3 C,D 1,3,4,5,6 G3,G4,G6,G5
SP4 C 7,22 G12
请注意,由于SP4的组不存在于其他任何行中,因此尚未与任何行合并.
Note that SP4 has not been merged with any rows since its group was not present in any other rows.
请问有人有主意吗?非常感谢您的帮助和时间
Does someone have an idea please ? Thank you very much for your help and time
这里是数据框,如果有帮助的话:
Here is the dataframe if it can helps:
structure(list(species = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 4L), .Label = c("SP1", "SP2", "SP3", "SP4"), class = "factor"),
family = structure(c(1L, 2L, 4L, 1L, 5L, 3L, 6L, 1L, 3L), .Label = c("A",
"B", "C", "C,D", "D,C", "E"), class = "factor"), Events = structure(c(2L,
7L, 5L, 3L, 6L, 4L, 7L, 1L, 8L), .Label = c("10", "10,22",
"22,10", "4,5,3,6,1", "4,5,6,1,3", "6,5,4,3,1", "7", "7,22"
), class = "factor"), groups = structure(c(1L, 3L, 4L, 1L,
6L, 5L, 3L, 1L, 2L), .Label = c("G1", "G12", "G2", "G3,G4,G5,G6",
"G3,G6,G5", "G4,G6,G5,G3"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
我可以做什么并尝试过:
到目前为止,我只知道如何使用dplyr中的类似内容将具有精确重复值的行合并:
So far I only know how to merge rows with exact duplicated value using something like that in dplyr :
desired_df <- df %>%
group_by_at(vars(-species)) %>%
summarize(species = toString(species)) %>%
ungroup() %>%
select(names(df))
但是这里我们没有确切的重复值,相反,我正在寻找可以在另一行中出现的逗号
之间.
but here we do not have exact duplicated values, instead I'm looking for between comma
that can be present in another row.
推荐答案
这是所有方法的解决方案(将输入数据框称为 dat
).
Here is an all-tidyverse solution (calling the input data frame dat
).
请注意,该解决方案与您提供的期望输出不相同..这是因为您声明的规则是合并每一列至少有一个重复元素的行",但物种除外."根据该规则,第2行和第7行不应合并,因为它们没有共同的 family
.
Please note that this solution is not identical to the desired output you gave. This is because you stated the rule is to "merge rows for which there is at least one duplicated element for each column, except species." By that rule, rows 2 and 7 should not be merged because they share no family
in common.
首先,将要测试重叠值的三列转换为列表列.现在,这些列的每个元素都是一个列表.我还将 Events
列强制为数字,以便其可以正确排序.
First, convert the three columns we are going to test for overlapping values into list-columns. Now each element of those columns is a list. I also coerced the Events
column to a numeric so that it will sort properly.
library(tidyverse)
dat <- dat %>%
mutate(across(c(family, Events, groups), ~ strsplit(as.character(.), split = ','))) %>%
mutate(Events = map(Events, as.numeric))
接下来,定义一个函数来折叠数据帧的每一行.该函数采用参数 i
,它是行索引.在该函数中,我们做两件事:
Next, define a function to collapse each row of the data frame. The function takes argument i
which is a row index. Within the function, we do two things:
- 首先,我们使用
pmap_lgl
遍历数据帧的每一行,以检查所有三列family
,Events
,和groups
与行i
至少具有一个共享值,因此应折叠.例如,如果i == 1
,则将为第1,4和8行返回TRUE
. - 接下来,我们仅对返回
TRUE
的行过滤dat
,并将函数应用于这些行的所有列.该函数将这些行中的所有列折叠为以逗号分隔的具有唯一值的字符串.
- First we use
pmap_lgl
to iterate across each row of the data frame to check for which rows all the three columnsfamily
,Events
, andgroups
have at least one shared value with rowi
and therefore should be collapsed. For example, ifi==1
this will returnTRUE
for rows 1,4,and 8. - Next, we filter
dat
for only those rows that returnedTRUE
, and apply a function to all columns of those rows. The function collapses all columns in those rows into comma-separated strings of the sorted unique values.
collapse_rows <- function(i) {
rows_collapse <- pmap_lgl(dat, function(family, Events, groups, ...)
any(dat$family[[i]] %in% family) & any(dat$Events[[i]] %in% Events) & any(dat$groups[[i]] %in% groups))
dat %>%
filter(rows_collapse) %>%
mutate(across(everything(), ~ paste(sort(unique(unlist(.))), collapse = ',')))
}
最后,我们将此功能应用于每个行索引.我们最终得到重复的行,例如,初始输出的行1、4和8将相同.我们使用 distinct
删除所有这些重复项.
Finally we apply this function to each row index. We end up with duplicated rows, for example rows 1, 4, and 8 of the initial output will be identical. We use distinct
to remove all of those duplicates.
dat_collapse <- map_dfr(1:nrow(dat), collapse_rows) %>% distinct
最终输出:
species family Events groups
1 SP1,SP2,SP3 A 10,22 G1
2 SP1 B 7 G2
3 SP1,SP2,SP3 C,D 1,3,4,5,6 G3,G4,G5,G6
4 SP3 E 7 G2
5 SP4 C 7,22 G12
这篇关于当在另一行中发现元素逗号分隔时合并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!