如果数据复杂,如何分离行 [英] how to separate_rows in case of a complex data

查看:35
本文介绍了如果数据复杂,如何分离行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说实话,这个问题最初不是我的.这个问题促使我把它放在一个简化的案例中.

Truly speaking this problem isn't originally mine. This problem prompted me to put it up in a simplified case.

因此,我必须根据当前场景中的分隔符 ; 将输入到单元格(在列中)的数据分成单独的行.这可以使用 tidyr::separate_rows() 轻松完成.近列中的数据已经连接在一起.现在的问题,实际上有两个,是 -

So, I have to separate data entered into cells (in columns) into separate rows on the basis of delimiter which is ; in the present scenario. This can be done easily using tidyr::separate_rows(). The data in nearly columns is already concatenated. Now the problem, actually two, are -

  1. 可能有由 ; 分隔的文本,但可能在闭括号 () 内,例如(text_A;text_B) 是要保持连接而不是分开的.
  2. 每行不同列的单元格之间可能存在奇数级联.在这种情况下,该行将被分成 单独的行,等于最大连接数.对于串联较少的其他每一列,最后一个 text_value 可能只是重复.
  1. There may be text separated by ; but may be inside closed parenthesis () e.g. (text_A;text_B) which is to be left concatenated and not to be separated.
  2. There may be uneven number of concatenations between cells of different columns in each row. In such case the row is to be broken into separate rows equal to maximum number of concatenations. For every other column where fewer concatenations are there, last text_value may be just repeated.

一个可重现的例子如下

input <- data.frame(
  stringsAsFactors = FALSE,
  col_1 = c("A", "B", "C"),
  Col_2 = c("Text_A;Text_B","Text_C","Text_D;(Text_E;Text_F)"),
  Col_3 = c("Text_1", "Text_2;Text_3", "Text_4"),
  Col_4 = c("Text_a;(Text_b;Text_c);(Text_d;Text_dd)","Text_e","Text_f;Text_g")
)

input
  col_1                  Col_2         Col_3                                   Col_4
1     A          Text_A;Text_B        Text_1 Text_a;(Text_b;Text_c);(Text_d;Text_dd)
2     B                 Text_C Text_2;Text_3                                  Text_e
3     C Text_D;(Text_E;Text_F)        Text_4                           Text_f;Text_g

所需的输出如下:

output
#>   col_1           Col_2  Col_3            Col_4
#> 1     A          Text_A Text_1           Text_a
#> 2     A          Text_B Text_1  (Text_b;Text_c)
#> 3     A          Text_B Text_1 (Text_d;Text_dd)
#> 4     B          Text_C Text_2           Text_e
#> 5     B          Text_C Text_3           Text_e
#> 6     C          Text_D Text_4           Text_f
#> 7     C (Text_E;Text_F) Text_4           Text_g

tidyverse 以外的其他答案也是可以接受的.

Answers other than tidyverse are also acceptable.

推荐答案

这是我的方法,仅假设您的列中没有序列 "":>

Here is my approach, assuming only that there is no sequence "<sep>" in your columns:

input %>%
  mutate(across(-col_1,
                ~ str_replace_all(., "\\([^)]*\\)",
                                  \(x) str_replace_all(x, ";", "<sep>")))) %>%
  pmap(\(...) {
    args <- list(...)
    entries <- map(args[-1], ~ first(str_split(., ";")))
    map(entries, \(e) {
      c(e, rep(e[length(e)], do.call(max, map(entries, length)) - length(e)))
    }) %>%
    bind_rows() %>%
    bind_cols(args[1], .)
  }) %>%
  bind_rows() %>%
  mutate(across(-col_1, ~ str_replace_all(., "<sep>", ";")))

返回:

# A tibble: 7 x 4
  col_1 Col_2           Col_3  Col_4
  <chr> <chr>           <chr>  <chr>
1 A     Text_A          Text_1 Text_a
2 A     Text_B          Text_1 (Text_b;Text_c)
3 A     Text_B          Text_1 (Text_d;Text_dd)
4 B     Text_C          Text_2 Text_e
5 B     Text_C          Text_3 Text_e
6 C     Text_D          Text_4 Text_f
7 C     (Text_E;Text_F) Text_4 Text_g

这篇关于如果数据复杂,如何分离行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆