拆分不同的长度值并绑定到列 [英] Split different lengths values and bind to columns
问题描述
我有一个相当大的(约10万观察值)数据集,类似于:
I've got a rather large (around 100k observations) data set, similar to this:
data <- data.frame(
ID = seq(1, 5, 1),
Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"),
stringsAsFactors=F)
data
ID Values
1 1 1,2,3
2 2 4
3 3
4 4 4,1,6,5,1,1,6
5 5 0,0
我要用分割值列,用
NA $分割
c $ c>错过的单元格:
I want to split the Values column by ","
with NA
for missed cells:
ID v1 v2 v3 v4 v5 v6 v7
1 1 2 3 NA NA NA NA
2 4 NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 4 1 6 5 1 1 6
5 0 0 NA NA NA NA NA
...
最佳尝试 strsplit
+ rbind
:
df <- data.frame(do.call(
"rbind",
strsplit(as.character(data$Values), split = "," , fixed = FALSE)
))
但 rbind
函数只是循环使用所有'short' NA。
发现类似问题
But rbind
function just recycles all 'short' rows instead to set an "NA".
Have found similar problem
非常感谢,Leo
推荐答案
我建议您查看我的 cSplit
功能或手动处理问题。
I would suggest looking at my cSplit
function or approaching the problem manually.
cSplit
方法简单地是:
cSplit(data, "Values", ",")
# ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1: 1 1 2 3 NA NA NA NA
# 2: 2 4 NA NA NA NA NA NA
# 3: 3 NA NA NA NA NA NA
# 4: 4 4 1 6 5 1 1 6
# 5: 5 0 0 NA NA NA NA NA
手动显示如下:
## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)
^^这实际上是在 cSplit
中发生的,但该函数有一些其他选项和一些基本错误检查等等,这可能会使它比纯手动方法(或写一个函数来解决你的具体问题)慢一点。
^^ That's pretty much what goes on in cSplit
, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).
这两种方法比data.table+reshape2方法快。此外,由于每一行都被单独处理,即使您有重复的ID值,也不应该有任何问题 - 您的输出应该与输入行数相同。
Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.
我对更多的行和数据做了基准测试,
I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).
以下是示例数据:
set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x)
paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)
这些是要测试的函数:
fun1a <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ","))), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}
fun1b <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ",", fixed = TRUE),
use.names = FALSE)), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}
fun2 <- function(inDT) {
cSplit(DT, "Values", ",")
}
fun3 <- function(inDF) {
Split <- strsplit(inDF$Values, ",", fixed = TRUE)
Ncol <- vapply(Split, length, 1L)
M <- matrix(NA_character_, nrow = nrow(inDF),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
M[cbind(rep(1:nrow(inDF), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
data.table(ID = inDF$ID, M)
}
$ b b
以下是结果:
Here are the results:
library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339 20
# fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568 20
## Didn't want to microbenchmark here...
system.time(fun1a(DT))
# user system elapsed
# 16.92 0.50 17.59
system.time(fun1b(DT)) # fixed = TRUE & use.names = FALSE
# user system elapsed
# 11.54 0.42 12.01
注意: fun1a
和 fun1b
的结果不会与 fun2
和 fun3
。
NOTE: The results of fun1a
and fun1b
would not be the same as those of fun2
and fun3
because of the duplicated IDs.
这篇关于拆分不同的长度值并绑定到列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!