大数据集上的行方式操作 [英] Row-wise manipulation on large dataset
问题描述
我正在寻找一种更快的方式来实现以下操作。数据集包含> 1M行,但我提供了一个简化的示例来说明任务 -
创建数据表 -
dt< - data.table(name = c(john,jill),a1 = c(1,4),a2 = c(2,5),a3 = c (3,6),
b1 = c(10,40),b2 = c(20,50),b3 = c(30,60)
colGroups < (a,b)#从a开始的列和在b中的列
原始数据集
-------------- ---------------------
name a1 a2 a3 b1 b2 b3
john 1 2 3 10 20 30
jill 4 5 6 40 50 60
上述数据集经过转换,为每个唯一名称添加了2个新行在每一行中,为每组列独立地左移值(在本例中我使用了列和b列,但还有更多)
转换的数据集
-----------------------------------
name a1 a2 a3 b1 b2 b3
john 1 2 3 10 20 30#John的第一行
john 2 3 0 20 30 0#a左移,b左移
john 3 0 0 30 0 0#与上面相同,再次左移
jill 4 5 6 40 50 60#重复Jill
jill 5 6 0 50 60 0
jill 6 0 0 60 0 0
等等。我的数据集非常大,这就是为什么我试图看看是否有一个有效的方法来实现这个。
提前感谢。
$ b $ 更快:更快的解决方案是使用索引如下(在1e6上需要大约4秒* 7):ll < - vector(list,3)
ll [[1]]< ; - copy(dt [,-1,with = FALSE])
d_idxfor(j in 1:2){
tmp< - vector(list,2)
for(i in seq_along(colGroups)){
idx < - ((i-1)* 3 + 2):( (idx,d_idx [i] :( d_idx [i] + j-1)),$ b $(i * 3)+1)
tmp [[i] b with = FALSE],data.table(matrix(0,ncol = j)))
}
ll [[j + 1]] < - do.call(cbind,tmp)
}
ans< - cbind(data.table(name = dt $ name),rbindlist(ll))
setkey(ans,name)
第一次尝试(旧):
非常有趣的问题。我使用melt.data.table
和dcast.data.table
(从1.8.11)如下:require(data.table)
require(reshape2)
#调用melt.data.table,返回一个data.table(非常快)
ans< - melt(dt,id = 1,measure = 2:7,variable.factor = FALSE)[,
grp:= rep(colGroups,each = nrow(dt)* 3)]
setkey(ans,name,grp)
ans& 1:。]],
变量[1:.N-2]]),value = c(value,value [-1],
value [ )]),id2 = rep.int(1:3,3:1)),list(name,grp)]
#dcast在reshape2中还不是一个S3泛型, $ b ans< - dcast.data.table(ans,name + id2〜variable,fill = 0L)[,id2:= NULL]
在具有相同列数的1e6行上进行基准测试:
require(data.table)
require(reshape2)
set.seed(45)
N< - 1e6
dt< - cbind(data.table(name = paste(x,1 :N,sep =)),
矩阵(样本(10,6 * N,TRUE),nrow = N))
setnames(dt,c(name,a1 a2,a3,b1,b2,b3))
colGroups = c(a,b)
system.time
ans< - melt(dt,id = 1,measure = 2:7,variable.factor = FALSE)[,
grp:= rep(colGroups,each = nrow(dt)* 3) ]
setkey(ans,name,grp)
ans< - ans [,list(variable = c(variable,variable [1 :(。N-1)],
variable [ 1:(N-2)]),value = c(value,value [-1],
value [ - (1:2)]),id2 = rep.int(1:3,3: 1)),list(name,grp)]
ans< - dcast.data.table(ans,name + id2〜variable,fill = 0L)[,id2:= NULL]
})
#用户系统已过
#45.627 2.197 52.051
I am looking for a faster way to achieve the operation below. The dataset contains > 1M rows but I have provided a simplified example to illustrate the task --
To create the data table -- dt <- data.table(name=c("john","jill"), a1=c(1,4), a2=c(2,5), a3=c(3,6), b1=c(10,40), b2=c(20,50), b3=c(30,60)) colGroups <- c("a","b") # Columns starting in "a", and in "b" Original Dataset ----------------------------------- name a1 a2 a3 b1 b2 b3 john 1 2 3 10 20 30 jill 4 5 6 40 50 60
The above dataset is transformed such that 2 new rows are added for each unique name and in each row, the values are left shifted for each group of columns independently (in this example I have used a columns and b columns but there are many more)
Transformed Dataset ----------------------------------- name a1 a2 a3 b1 b2 b3 john 1 2 3 10 20 30 # First Row for John john 2 3 0 20 30 0 # "a" values left shifted, "b" values left shifted john 3 0 0 30 0 0 # Same as above, left-shifted again jill 4 5 6 40 50 60 # Repeated for Jill jill 5 6 0 50 60 0 jill 6 0 0 60 0 0
And so on. My dataset is extremely large, which is why I am trying to see if there is an efficient way to implement this.
Thanks in advance.
解决方案Update: A (much) faster solution would be to play with the indices as follows (takes about 4 seconds on 1e6*7):
ll <- vector("list", 3) ll[[1]] <- copy(dt[, -1, with=FALSE]) d_idx <- seq(2, ncol(dt), by=3) for (j in 1:2) { tmp <- vector("list", 2) for (i in seq_along(colGroups)) { idx <- ((i-1)*3+2):((i*3)+1) tmp[[i]] <- cbind(dt[, setdiff(idx, d_idx[i]:(d_idx[i]+j-1)), with=FALSE], data.table(matrix(0, ncol=j))) } ll[[j+1]] <- do.call(cbind, tmp) } ans <- cbind(data.table(name=dt$name), rbindlist(ll)) setkey(ans, name)
First attempt (old): Very interesting problem. I'd approach it using
melt.data.table
anddcast.data.table
(from 1.8.11) as follows:require(data.table) require(reshape2) # melt is S3 generic, calls melt.data.table, returns a data.table (very fast) ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, grp := rep(colGroups, each=nrow(dt)*3)] setkey(ans, name, grp) ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], variable[1:(.N-2)]), value=c(value, value[-1], value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)] # dcast in reshape2 is not yet a S3 generic, have to call by full name ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]
Benchmarking on 1e6 rows with same number of columns:
require(data.table) require(reshape2) set.seed(45) N <- 1e6 dt <- cbind(data.table(name=paste("x", 1:N, sep="")), matrix(sample(10, 6*N, TRUE), nrow=N)) setnames(dt, c("name", "a1", "a2", "a3", "b1", "b2", "b3")) colGroups = c("a", "b") system.time({ ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, grp := rep(colGroups, each=nrow(dt)*3)] setkey(ans, name, grp) ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], variable[1:(.N-2)]), value=c(value, value[-1], value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)] ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL] }) # user system elapsed # 45.627 2.197 52.051
这篇关于大数据集上的行方式操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!