将 data.table 中的列拆分为多行 [英] split column in data.table to multiple rows

查看:17
本文介绍了将 data.table 中的列拆分为多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常有表格,其中单个单元格可能包含多个值(由一些字符分隔符划分),我需要拆分这些记录,例如:

I often have tables where a single cell may contain multiple values (divided by some character separator), and I need to split such records, for example:

dt1 <- fread("V1 V2 V3
             x b;c;d 1
             y d;ef  2
             z d;ef  3")

应该这样写:

#    V1 V2 V3
# 1:  x  b  1
# 2:  x  c  1
# 3:  x  d  1
# 4:  y  d  2
# 5:  y ef  2
# 6:  z  d  3
# 7:  z ef  3

到目前为止,我做了以下功能:

So far I made the following function:

# I omit all error-checking code here and assume that
# dtInput   is a valid data.table and
# col2split is a name of existing column
splitcol2rows <- function(dtInput, col2split, sep){
  ori.names <- names(dtInput); # save original order of columns
  ori.keys  <-   key(dtInput); # save original keys

  # create new table with 2 columns:
  # one is original "un-splitted" column (will be later used as a key)
  # and second one is result of strsplit:
  dt.split <- dtInput[, 
                    .(tmp.add.col=rep(unlist(strsplit(get(col2split),sep,T)), .N)),
                    by=col2split]
  dt.split <- unique(dt.split, by=NULL);

  # now use that column as a key:
  setkeyv(dt.split, col2split)
  setkeyv(dtInput, col2split)
  dtInput <- dt.split[dtInput, allow.cartesian=TRUE];

  # leave only 'splitted' column
  dtInput[, c(col2split):=NULL]; 
  setnames(dtInput, 'tmp.add.col', col2split); 

  # restore original columns order and keys
  setcolorder(dtInput, ori.names);
  setkeyv(dtInput, ori.keys);

  return(dtInput);
}

它工作正常(检查示例输出为 splitcol2rows(dt1, 'V2', ';')[]),但我确信这个解决方案远非最佳,将不胜感激任何建议.例如,我查看了 Matt 在对问题对data.table的每一行应用一个函数",我喜欢它在不创建中间表(我的dt.split)的情况下进行管理,但在我的情况下我需要保留所有其他列,否则不知道该怎么做.

it works fine (check the example output as splitcol2rows(dt1, 'V2', ';')[]), but I'm sure this solution is far from optimal and would be grateful for any advices. For example, I looked through the solution proposed by Matt in the answer to the question "Applying a function to each row of a data.table" and I like that it manages without creating intermediate table (my dt.split), but in my case I need to keep all other columns and don't see how to do that otherwise.

UPD.首先,从@RichardScriven 提出的解决方案开始,我开始重写我的函数,让它变得更短更容易阅读:

UPD. First, staring from the solution proposed by @RichardScriven, I came to re-writing my function so it became much shorter and easier to read:

splitcol2rows_mget <- function(dtInput, col2split, sep){
  dtInput <- dtInput[, .(tmp.add.col = unlist(strsplit(get(col2split),sep,T))), by=names(dtInput)]

  dtInput[, c(col2split):=NULL];
  setnames(dtInput, 'tmp.add.col', col2split); 
  return(dtInput);
}

它仍然有一些难看的部分,比如中间的 'tmp.add.col' 列,如果这些列已经存在于原始表中,可能会导致冲突.此外,这个更短的解决方案比我的第一个代码运行得更慢.而且它们都比 splitstackshape 包中的 cSplit() 慢:

It still has some ugly pieces, like intermediate 'tmp.add.col' column which might cause conflict if such columns already existed in the original table. In addition, this shorter solution turned out to work slower than my first code. And both of them are slower than cSplit() from splitstackshape package:

require('microbenchmark')
require('splitstackshape')

splitMy1 <- function(input){return(splitcol2rows(input, col2split = 'V2', sep = ';'))}
splitMy2 <- function(input){return(splitcol2rows_mget(input, col2split = 'V2', sep = ';'))}
splitSH  <- function(input){return(cSplit(input, splitCols = 'V2', sep = ';', direction = 'long'))}

# Smaller table, 100 repeats:
set.seed(1)
num.rows <- 1e4;
dt1 <- data.table(V1=seq_len(num.rows),
                  V2=replicate(num.rows,paste0(sample(letters, runif(1,1,6), T), collapse = ";")),
                  V3=rnorm(num.rows))
print(microbenchmark(splitMy1(dt1), splitMy2(dt1), splitSH(dt1), times=100L))
#Unit: milliseconds
#          expr      min       lq     mean   median       uq       max neval
# splitMy1(dt1) 56.34475 58.53842 68.11128 62.51419 79.79727  98.96797   100
# splitMy2(dt1) 61.84215 64.59619 76.41503 69.02970 88.49229 132.43679   100
#  splitSH(dt1) 31.29671 33.14389 38.28108 34.91696 39.31291  83.58625   100    

# Bigger table, 1 repeat:
set.seed(1)
num.rows <- 5e5;
dt1 <- data.table(V1=seq_len(num.rows),
                  V2=replicate(num.rows,paste0(sample(letters, runif(1,1,6), T), collapse = ";")),
                  V3=rnorm(num.rows))
print(microbenchmark(splitMy1(dt1), splitMy2(dt1), splitSH(dt1), times=1L))

#Unit: seconds
#          expr      min       lq     mean   median       uq      max neval
# splitMy1(dt1) 2.955825 2.955825 2.955825 2.955825 2.955825 2.955825     1
# splitMy2(dt1) 3.693612 3.693612 3.693612 3.693612 3.693612 3.693612     1
#  splitSH(dt1) 1.990201 1.990201 1.990201 1.990201 1.990201 1.990201     1

推荐答案

splitstackshape 包中有一个名为 cSplit 的函数非常适合此任务.只需通过;"作为分隔符,long"作为获取我们需要的方向.

There's a function in the package splitstackshape called cSplit which is perfectly suited for this task. Simply pass ";" as the separator and "long" as the direction to get what we need.

> library(splitstackshape)
> dat <- data.frame(V1 = c("x", "y", "z"), V2 = c("b;c;d", "d;ef", "d;ef"), V3 = 1:3, stringsAsFactors = FALSE)
> cSplit(dat, "V2", sep = ";", direction = "long")
#   V1 V2 V3
# 1:  x  b  1
# 2:  x  c  1
# 3:  x  d  1
# 4:  y  d  2
# 5:  y ef  2
# 6:  z  d  3
# 7:  z ef  3

这篇关于将 data.table 中的列拆分为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆