如果数据复杂,如何分离行 [英] how to separate_rows in case of a complex data
问题描述
说实话,这个问题最初不是我的.这个问题促使我把它放在一个简化的案例中.
Truly speaking this problem isn't originally mine. This problem prompted me to put it up in a simplified case.
因此,我必须根据当前场景中的分隔符 ;
将输入到单元格(在列中)的数据分成单独的行.这可以使用 tidyr::separate_rows()
轻松完成.近列中的数据已经连接在一起.现在的问题,实际上有两个,是 -
So, I have to separate data entered into cells (in columns) into separate rows on the basis of delimiter which is ;
in the present scenario. This can be done easily using tidyr::separate_rows()
. The data in nearly columns is already concatenated. Now the problem, actually two, are -
- 可能有由
;
分隔的文本,但可能在闭括号()
内,例如(text_A;text_B)
是要保持连接而不是分开的. - 每行不同列的单元格之间可能存在奇数级联.在这种情况下,该行将被分成
单独的行
,等于最大连接数.对于串联较少的其他每一列,最后一个 text_value 可能只是重复.
- There may be text separated by
;
but may be inside closed parenthesis()
e.g.(text_A;text_B)
which is to be left concatenated and not to be separated. - There may be uneven number of concatenations between cells of different columns in each row. In such case the row is to be broken into
separate rows
equal to maximum number of concatenations. For every other column where fewer concatenations are there, last text_value may be just repeated.
一个可重现的例子如下
input <- data.frame(
stringsAsFactors = FALSE,
col_1 = c("A", "B", "C"),
Col_2 = c("Text_A;Text_B","Text_C","Text_D;(Text_E;Text_F)"),
Col_3 = c("Text_1", "Text_2;Text_3", "Text_4"),
Col_4 = c("Text_a;(Text_b;Text_c);(Text_d;Text_dd)","Text_e","Text_f;Text_g")
)
input
col_1 Col_2 Col_3 Col_4
1 A Text_A;Text_B Text_1 Text_a;(Text_b;Text_c);(Text_d;Text_dd)
2 B Text_C Text_2;Text_3 Text_e
3 C Text_D;(Text_E;Text_F) Text_4 Text_f;Text_g
所需的输出如下:
output
#> col_1 Col_2 Col_3 Col_4
#> 1 A Text_A Text_1 Text_a
#> 2 A Text_B Text_1 (Text_b;Text_c)
#> 3 A Text_B Text_1 (Text_d;Text_dd)
#> 4 B Text_C Text_2 Text_e
#> 5 B Text_C Text_3 Text_e
#> 6 C Text_D Text_4 Text_f
#> 7 C (Text_E;Text_F) Text_4 Text_g
除tidyverse
以外的其他答案也是可以接受的.
Answers other than tidyverse
are also acceptable.
推荐答案
这是我的方法,仅假设您的列中没有序列 "
:>
Here is my approach, assuming only that there is no sequence "<sep>"
in your columns:
input %>%
mutate(across(-col_1,
~ str_replace_all(., "\\([^)]*\\)",
\(x) str_replace_all(x, ";", "<sep>")))) %>%
pmap(\(...) {
args <- list(...)
entries <- map(args[-1], ~ first(str_split(., ";")))
map(entries, \(e) {
c(e, rep(e[length(e)], do.call(max, map(entries, length)) - length(e)))
}) %>%
bind_rows() %>%
bind_cols(args[1], .)
}) %>%
bind_rows() %>%
mutate(across(-col_1, ~ str_replace_all(., "<sep>", ";")))
返回:
# A tibble: 7 x 4
col_1 Col_2 Col_3 Col_4
<chr> <chr> <chr> <chr>
1 A Text_A Text_1 Text_a
2 A Text_B Text_1 (Text_b;Text_c)
3 A Text_B Text_1 (Text_d;Text_dd)
4 B Text_C Text_2 Text_e
5 B Text_C Text_3 Text_e
6 C Text_D Text_4 Text_f
7 C (Text_E;Text_F) Text_4 Text_g
这篇关于如果数据复杂,如何分离行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!