大数据集上的行方式操作 [英] Row-wise manipulation on large dataset

查看：71 发布时间：2017/3/12 12:05:17 r data.table

本文介绍了大数据集上的行方式操作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种更快的方式来实现以下操作。数据集包含> 1M行，但我提供了一个简化的示例来说明任务 -

 创建数据表 - 
 
 dt<  -  data.table（name = c（john，jill），a1 = c（1,4），a2 = c（2,5），a3 = c （3,6），
 b1 = c（10,40），b2 = c（20,50），b3 = c（30,60）
 
 colGroups < （a，b）＃从a开始的列和在b中的列
 
原始数据集
 -------------- --------------------- 
 name a1 a2 a3 b1 b2 b3 
 john 1 2 3 10 20 30 
 jill 4 5 6 40 50 60

上述数据集经过转换，为每个唯一名称添加了2个新行在每一行中，为每组列独立地左移值（在本例中我使用了列和b列，但还有更多）

 转换的数据集
 ----------------------------------- 
 name a1 a2 a3 b1 b2 b3 
 john 1 2 3 10 20 30＃John的第一行
 john 2 3 0 20 30 0＃a左移，b左移
 john 3 0 0 30 0 0＃与上面相同，再次左移
 
 jill 4 5 6 40 50 60＃重复Jill 
 jill 5 6 0 50 60 0 
 jill 6 0 0 60 0 0

等等。我的数据集非常大，这就是为什么我试图看看是否有一个有效的方法来实现这个。

提前感谢。

$ b $ 更快：更快的解决方案是使用索引如下（在1e6上需要大约4秒* 7）：

  ll < -  vector（list，3）
 ll [[1]]< ;  -  copy（dt [，-1，with = FALSE]）
 d_idx  for（j in 1：2）{
 tmp<  -  vector（list，2）
 for（i in seq_along（colGroups））{
 idx < - （（i-1）* 3 + 2）:( （idx，d_idx [i] :( d_idx [i] + j-1）），$ b $（i * 3）+1）
 tmp [[i] b with = FALSE]，data.table（matrix（0，ncol = j）））
} 
 ll [[j + 1]] < -  do.call（cbind，tmp）
} 
 ans<  -  cbind（data.table（name = dt $ name），rbindlist（ll））
 setkey（ans，name）
  
 
 
 
 
 
  第一次尝试（旧）： 
非常有趣的问题。我使用 melt.data.table 和 dcast.data.table （从1.8.11）如下：
  require（data.table）
 require（reshape2）
＃调用melt.data.table，返回一个data.table（非常快）
 ans<  -  melt（dt，id = 1，measure = 2：7，variable.factor = FALSE）[，
 grp：= rep（colGroups，each = nrow（dt）* 3）] 
 setkey（ans，name，grp）
 ans& 1：。]]，
变量[1：.N-2]]），value = c（value，value [-1]，
 value [ ）]），id2 = rep.int（1：3,3：1）），list（name，grp）] 
＃dcast在reshape2中还不是一个S3泛型， $ b ans<  -  dcast.data.table（ans，name + id2〜variable，fill = 0L）[，id2：= NULL] 
  
在具有相同列数的1e6行上进行基准测试：
  require（data.table） 
 require（reshape2）
 set.seed（45）
 N<  -  1e6 
 dt<  -  cbind（data.table（name = paste（x，1 ：N，sep =）），
矩阵（样本（10,6 * N，TRUE），nrow = N））
 setnames（dt，c（name，a1 a2，a3，b1，b2，b3））
 colGroups = c（a，b）
 
 system.time 
 ans<  -  melt（dt，id = 1，measure = 2：7，variable.factor = FALSE）[，
 grp：= rep（colGroups，each = nrow（dt）* 3） ] 
 setkey（ans，name，grp）
 ans<  -  ans [，list（variable = c（variable，variable [1 :(。N-1）]，
 variable [ 1：（N-2）]），value = c（value，value [-1]，
 value [ - （1：2）]），id2 = rep.int（1：3,3： 1）），list（name，grp）] 
 ans<  -  dcast.data.table（ans，name + id2〜variable，fill = 0L）[，id2：= NULL] 
 
}）
 
＃用户系统已过
＃45.627 2.197 52.051 
  
 
I am looking for a faster way to achieve the operation below. The dataset contains > 1M rows but I have provided a simplified example to illustrate the task --
To create the data table --

dt <- data.table(name=c("john","jill"), a1=c(1,4), a2=c(2,5), a3=c(3,6), 
      b1=c(10,40), b2=c(20,50), b3=c(30,60))

colGroups <- c("a","b")   # Columns starting in "a", and in "b"

Original Dataset
-----------------------------------
name    a1   a2   a3   b1   b2   b3
john    1    2    3    10   20   30
jill    4    5    6    40   50   60
The above dataset is transformed such that 2 new rows are added for each unique name and in each row, the values are left shifted for each group of columns independently (in this example I have used a columns and b columns but there are many more)
Transformed Dataset
-----------------------------------
name    a1   a2   a3   b1   b2   b3
john    1    2    3    10   20   30  # First Row for John
john    2    3    0    20   30    0  # "a" values left shifted, "b" values left shifted
john    3    0    0    30   0     0  # Same as above, left-shifted again

jill    4    5    6    40   50   60  # Repeated for Jill
jill    5    6    0    50   60    0 
jill    6    0    0    60    0    0
And so on. My dataset is extremely large, which is why I am trying to see if there is an efficient way to implement this.

Thanks in advance.
 解决方案 
Update: A (much) faster solution would be to play with the indices as follows (takes about 4 seconds on 1e6*7):
ll <- vector("list", 3)
ll[[1]] <- copy(dt[, -1, with=FALSE])
d_idx <- seq(2, ncol(dt), by=3)
for (j in 1:2) {
    tmp <- vector("list", 2)
    for (i in seq_along(colGroups)) {
        idx <- ((i-1)*3+2):((i*3)+1)
        tmp[[i]] <- cbind(dt[, setdiff(idx, d_idx[i]:(d_idx[i]+j-1)), 
                        with=FALSE], data.table(matrix(0, ncol=j)))
    }
    ll[[j+1]] <- do.call(cbind, tmp)
}
ans <- cbind(data.table(name=dt$name), rbindlist(ll))
setkey(ans, name)




First attempt (old):
Very interesting problem. I'd approach it using melt.data.table and dcast.data.table (from 1.8.11) as follows:
require(data.table)
require(reshape2)
# melt is S3 generic, calls melt.data.table, returns a data.table (very fast)
ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, 
                    grp := rep(colGroups, each=nrow(dt)*3)]
setkey(ans, name, grp)
ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], 
          variable[1:(.N-2)]), value=c(value, value[-1],
     value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)]
# dcast in reshape2 is not yet a S3 generic, have to call by full name
ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]
Benchmarking on 1e6 rows with same number of columns:
require(data.table)
require(reshape2)
set.seed(45)
N <- 1e6
dt <- cbind(data.table(name=paste("x", 1:N, sep="")), 
               matrix(sample(10, 6*N, TRUE), nrow=N))
setnames(dt, c("name", "a1", "a2", "a3", "b1", "b2", "b3"))
colGroups = c("a", "b")

system.time({
ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, 
                    grp := rep(colGroups, each=nrow(dt)*3)]
setkey(ans, name, grp)
ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], 
          variable[1:(.N-2)]), value=c(value, value[-1],
     value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)]
ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]

})

#   user  system elapsed 
# 45.627   2.197  52.051 


                        
这篇关于大数据集上的行方式操作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大数据集上的行方式操作 [英] Row-wise manipulation on large dataset

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

大数据集上的行方式操作 [英] Row-wise manipulation on large dataset

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭