将一列中的逗号分隔的字符串拆分为单独的行 [英] Split comma-separated strings in a column into separate rows

查看:150
本文介绍了将一列中的逗号分隔的字符串拆分为单独的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,就像这样:

I have a data frame, like so:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

如您所见,director列中的某些条目是多个名称,以逗号分隔.我想将这些条目拆分为单独的行,同时保持另一列的值.例如,应该将上面数据框中的第一行分为两行,在director列中分别使用一个名称,在AB列中使用"A".

As you can see, some entries in the director column are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the director column and 'A' in the AB column.

推荐答案

这个老问题经常被用作欺骗对象(标记为r-faq).到目前为止,已经提供了6种不同的方法进行了三遍回答,但是缺乏基准作为指导,哪种方法是最快的 1 .

This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmark as guidance which of the approaches is the fastest1.

基准解决方案包括

  • Matthew Lundberg's base R approach but modified according to Rich Scriven's comment,
  • Jaap's two data.table methods and two dplyr / tidyr approaches,
  • Ananda's splitstackshapesolution,
  • and two additional variants of Jaap's data.table methods.

使用microbenchmark包(参见下面的代码),对8种不同的方法在6种不同大小的数据帧上进行了基准测试.

Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark package (see code below).

OP提供的样本数据仅包含20行.为了创建更大的数据帧,将这20行简单地重复1、10、100、1000、10000和100000次,这样问题大小最多可达200​​万行.

The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

基准结果表明,对于足够大的数据帧,所有data.table方法都比其他任何方法都快.对于具有约5000行以上的数据帧,Jaap的data.table方法2和变体DT3是最快的,幅度比最慢的方法快.

The benchmark results show that for sufficiently large data frames all data.table methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table method 2 and the variant DT3 are the fastest, magnitudes faster than the slowest methods.

值得注意的是,两种tidyverse方法和splistackshape解决方案的时间安排是如此相似,以至于很难消除图表中的曲线.在所有数据帧大小中,它们是基准测试方法中最慢的.

Remarkably, the timings of the two tidyverse methods and the splistackshape solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

对于较小的数据帧,Matt的base R解决方案和data.table方法4似乎比其他方法具有更少的开销.

For smaller data frames, Matt's base R solution and data.table method 4 seem to have less overhead than the other methods.

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

为问题大小为n

的基准运行定义函数

Define function for benchmark runs of problem size n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

针对不同问题规模运行基准测试

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

准备绘图数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

会话信息&软件包版本(摘录)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)


1 一个问题tidyverse答案,该问题已作为与此副本的副本而关闭问题.


1My curiosity was piqued by this exuberant comment Brilliant! Orders of magnitude faster! to a tidyverse answer of a question which was closed as a duplicate of this question.

这篇关于将一列中的逗号分隔的字符串拆分为单独的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆