拆分不同的长度值并绑定到列 [英] Split different lengths values and bind to columns

查看:101
本文介绍了拆分不同的长度值并绑定到列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的(约10万观察值)数据集,类似于:

I've got a rather large (around 100k observations) data set, similar to this:

data <- data.frame(
                 ID = seq(1, 5, 1),
                 Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"), 
                 stringsAsFactors=F)
data
  ID        Values
1  1         1,2,3
2  2             4
3  3              
4  4 4,1,6,5,1,1,6
5  5           0,0

我要用分割值列,用 NA c $ c>错过的单元格:

I want to split the Values column by "," with NA for missed cells:

ID v1 v2 v3 v4 v5 v6 v7
1  1  2  3  NA NA NA NA
2  4  NA NA NA NA NA NA
3  NA NA NA NA NA NA NA
4  4  1  6  5  1  1  6
5  0  0  NA NA NA NA NA
...

最佳尝试 strsplit + rbind

df <- data.frame(do.call(
                        "rbind",
                        strsplit(as.character(data$Values), split = "," , fixed = FALSE)
                        ))

rbind 函数只是循环使用所有'short' NA。
发现类似问题

But rbind function just recycles all 'short' rows instead to set an "NA". Have found similar problem

非常感谢,Leo

推荐答案

我建议您查看我的 cSplit 功能或手动处理问题。

I would suggest looking at my cSplit function or approaching the problem manually.

cSplit 方法简单地是:

cSplit(data, "Values", ",")
#    ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1:  1        1        2        3       NA       NA       NA       NA
# 2:  2        4       NA       NA       NA       NA       NA       NA
# 3:  3                NA       NA       NA       NA       NA       NA
# 4:  4        4        1        6        5        1        1        6
# 5:  5        0        0       NA       NA       NA       NA       NA

手动显示如下:

## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
            ncol = max(Ncol), 
            dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol), 
        sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)

^^这实际上是在 cSplit 中发生的,但该函数有一些其他选项和一些基本错误检查等等,这可能会使它比纯手动方法(或写一个函数来解决你的具体问题)慢一点。

^^ That's pretty much what goes on in cSplit, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).

这两种方法比data.table+reshape2方法快。此外,由于每一行都被单独处理,即使您有重复的ID值,也不应该有任何问题 - 您的输出应该与输入行数相同。

Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.

我对更多的行和数据做了基准测试,

I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).

以下是示例数据:

set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x) 
  paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)

这些是要测试的函数:

fun1a <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ","))), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun1b <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ",", fixed = TRUE), 
    use.names = FALSE)), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun2 <- function(inDT) {
  cSplit(DT, "Values", ",")
}

fun3 <- function(inDF) {
  Split <- strsplit(inDF$Values, ",", fixed = TRUE)
  Ncol <- vapply(Split, length, 1L)
  M <- matrix(NA_character_, nrow = nrow(inDF),
              ncol = max(Ncol), 
              dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
  M[cbind(rep(1:nrow(inDF), Ncol), 
          sequence(Ncol))] <- unlist(Split, use.names = FALSE)
  data.table(ID = inDF$ID, M)
}


$ b b

以下是结果:

Here are the results:

library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
#        expr      min       lq   median       uq      max neval
#    fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339    20
#  fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568    20

## Didn't want to microbenchmark here...
system.time(fun1a(DT))
#    user  system elapsed 
#   16.92    0.50   17.59
system.time(fun1b(DT))  # fixed = TRUE & use.names = FALSE
#    user  system elapsed 
#   11.54    0.42   12.01

注意: fun1a fun1b 的结果不会与 fun2 fun3

NOTE: The results of fun1a and fun1b would not be the same as those of fun2 and fun3 because of the duplicated IDs.

这篇关于拆分不同的长度值并绑定到列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆