将列中以逗号分隔的字符串拆分为单独的行 [英] Split comma-separated strings in a column into separate rows

查看:38
本文介绍了将列中以逗号分隔的字符串拆分为单独的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,就像这样:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa",Alan J. Pakula"、Alan Parker"、Alejandro Amenabar"、Alejandro Gonzalez Inarritu"、亚历杭德罗·冈萨雷斯·伊纳里图,本尼西奥·德尔·托罗",亚历杭德罗·冈萨雷斯·伊纳里图",亚历克斯·普罗亚斯"、亚历山大·霍尔"、阿方索·卡隆"、阿尔弗雷德·希区柯克"、《阿纳托尔·利特瓦克》、《安德鲁·亚当森、玛丽莲·福克斯》、《安德鲁·多米尼克》、安德鲁·斯坦顿"、安德鲁·斯坦顿、李·昂克里奇"、安吉丽娜·朱莉、约翰·史蒂文森"、安妮方丹",安东尼哈维"),AB = c('A','B','A','A','B','B','B','A','B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

如您所见,director 列中的某些条目是由逗号分隔的多个名称.我想将这些条目拆分为单独的行,同时保持另一列的值.例如,上面数据框中的第一行应该分成两行,director 列中的每行一个名称,AB 列中的每个名称为'A'.

解决方案

这个老问题经常被用作欺骗目标(标记为 r-faq).截至今天,它已经回答了 3 次,提供了 6 种不同的方法,但缺乏基准作为指导,哪种方法最快1.

基准解决方案包括

  • 基准测试结果表明,对于足够大的数据帧,所有 data.table 方法都比任何其他方法都快.对于超过大约 5000 行的数据帧,Jaap 的 data.table 方法 2 和变体 DT3 是最快的,比最慢的方法快很多.

    值得注意的是,两种tidyverse 方法和splistackshape 解决方案的时间非常相似,以至于很难区分图表中的曲线.它们是所有数据帧大小中最慢的基准方法.

    对于较小的数据帧,Matt 的基本 R 解决方案和 data.table 方法 4 的开销似乎比其他方法少.

    代码

    导演 <-c("Aaron Blaise, Bob Walker", "Akira Kurosawa", "Alan J. Pakula",《艾伦·帕克》、《亚历杭德罗·阿曼纳巴尔》、《亚历杭德罗·冈萨雷斯·伊纳里图》、亚历杭德罗·冈萨雷斯·伊纳里图,本尼西奥·德尔·托罗",亚历杭德罗·冈萨雷斯·伊纳里图",亚历克斯·普罗亚斯"、亚历山大·霍尔"、阿方索·卡隆"、阿尔弗雷德·希区柯克"、《阿纳托尔·利特瓦克》、《安德鲁·亚当森、玛丽莲·福克斯》、《安德鲁·多米尼克》、安德鲁·斯坦顿"、安德鲁·斯坦顿、李·昂克里奇"、安吉丽娜·朱莉、约翰·史蒂文森"、《安妮·方丹》、《安东尼·哈维》)AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B",一种","A", "B", "B", "B", "B", "B", "B", "A")图书馆(数据表)图书馆(magrittr)

    为问题大小的基准运行定义函数n

    run_mb <- function(n) {# 根据问题大小计算基准运行次数`n`mb_times <- scales::squish(10000L/n , c(3L, 100L))猫(n,",mb_times,
    ")# 创建数据DF <- data.frame(director = rep(director, n), AB = rep(AB, n))DT <- as.data.table(DF)# 开始基准测试微基准::微基准(matt_mod = {s <- strsplit(as.character(DF$director), ',')data.frame(director=unlist(s), AB=rep(DF$AB,lengths(s))},jaap_DT1 = {DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB][!is.na(导演)]},jaap_DT2 = {DT[, strsplit(as.character(director), ",", fixed=TRUE),by = .(AB, Director)][,.(director = V1, AB)]},jaap_dplyr = {DF%>%dplyr::mutate(director = strsplit(as.character(director), ",")) %>%tidyr::unnest(导演)},jaap_tidyr = {tidyr::separate_rows(DF, 导演, sep = ",")},cSplit = {splitstackshape::cSplit(DF, "director", ",", direction = "long")},DT3 = {DT[, strsplit(as.character(director), ",", fixed=TRUE),by = .(AB, 导演)][, 导演:= NULL][, setnames(.SD, "V1", "director")]},DT4 = {DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))),by = .(AB)]},时间 = mb_times)}

    针对不同规模的问题运行基准测试

    # 定义问题大小的向量n_rep <- 10L^(0:5)# 针对不同的问题大小运行基准测试mb <- lapply(n_rep, run_mb)

    准备绘图数据

    mbl <- rbindlist(mb, idcol = "N")mbl[, n_row := NROW(director) * n_rep[N]]mba <- mbl[, .(median_time = medium(time), N = .N), by = .(n_row, expr)]mba[, expr := forcats::fct_reorder(expr, -median_time)]

    创建图表

    库(ggplot2)ggplot(mba, aes(n_row,median_time*1e-6, group = expr, color = expr)) +geom_point() + geom_smooth(se = FALSE) +scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() +xlab("行数") + ylab("执行时间的中位数[ms]") +ggtitle("微基准测试结果") + theme_bw()

    会话信息 &包版本(摘录)

    devtools::session_info()#会话信息# 版本 R 版本 3.3.2 (2016-10-31)# 系统 x86_64,mingw32#包裹# data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2)# dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1)# forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)# ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)# magrittr * 1.5 2014-11-22 CRAN (R 3.3.0)# 微基准测试 1.4-2.1 2015-11-25 CRAN (R 3.3.3)# scales 0.4.1 2016-11-09 CRAN (R 3.3.2)# splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3)# tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)

    <小时>

    1这个热情洋溢的评论 太棒了!速度快了几个数量级!一个问题tidyverse 回答已关闭作为此问题的副本.

    I have a data frame, like so:

    data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                            "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                            "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                            "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                            "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                            "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                            "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))
    

    As you can see, some entries in the director column are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the director column and 'A' in the AB column.

    解决方案

    This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmark as guidance which of the approaches is the fastest1.

    The benchmarked solutions include

    Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark package (see code below).

    The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

    Benchmark results

    The benchmark results show that for sufficiently large data frames all data.table methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table method 2 and the variant DT3 are the fastest, magnitudes faster than the slowest methods.

    Remarkably, the timings of the two tidyverse methods and the splistackshape solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

    For smaller data frames, Matt's base R solution and data.table method 4 seem to have less overhead than the other methods.

    Code

    director <- 
      c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
        "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
        "Anne Fontaine", "Anthony Harvey")
    AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
            "A", "B", "B", "B", "B", "B", "B", "A")
    
    library(data.table)
    library(magrittr)
    

    Define function for benchmark runs of problem size n

    run_mb <- function(n) {
      # compute number of benchmark runs depending on problem size `n`
      mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
      cat(n, " ", mb_times, "
    ")
      # create data
      DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
      DT <- as.data.table(DF)
      # start benchmarks
      microbenchmark::microbenchmark(
        matt_mod = {
          s <- strsplit(as.character(DF$director), ',')
          data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
        jaap_DT1 = {
          DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
             ][!is.na(director)]},
        jaap_DT2 = {
          DT[, strsplit(as.character(director), ",", fixed=TRUE), 
             by = .(AB, director)][,.(director = V1, AB)]},
        jaap_dplyr = {
          DF %>% 
            dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
            tidyr::unnest(director)},
        jaap_tidyr = {
          tidyr::separate_rows(DF, director, sep = ",")},
        cSplit = {
          splitstackshape::cSplit(DF, "director", ",", direction = "long")},
        DT3 = {
          DT[, strsplit(as.character(director), ",", fixed=TRUE),
             by = .(AB, director)][, director := NULL][
               , setnames(.SD, "V1", "director")]},
        DT4 = {
          DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
             by = .(AB)]},
        times = mb_times
      )
    }
    

    Run benchmark for different problem sizes

    # define vector of problem sizes
    n_rep <- 10L^(0:5)
    # run benchmark for different problem sizes
    mb <- lapply(n_rep, run_mb)
    

    Prepare data for plotting

    mbl <- rbindlist(mb, idcol = "N")
    mbl[, n_row := NROW(director) * n_rep[N]]
    mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
    mba[, expr := forcats::fct_reorder(expr, -median_time)]
    

    Create chart

    library(ggplot2)
    ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
      geom_point() + geom_smooth(se = FALSE) + 
      scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
      xlab("number of rows") + ylab("median of execution time [ms]") +
      ggtitle("microbenchmark results") + theme_bw()
    

    Session info & package versions (excerpt)

    devtools::session_info()
    #Session info
    # version  R version 3.3.2 (2016-10-31)
    # system   x86_64, mingw32
    #Packages
    # data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
    # dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
    # forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
    # ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
    # magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
    # microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
    # scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
    # splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
    # tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)
    


    1My curiosity was piqued by this exuberant comment Brilliant! Orders of magnitude faster! to a tidyverse answer of a question which was closed as a duplicate of this question.

    这篇关于将列中以逗号分隔的字符串拆分为单独的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆