在更少的时间和使用更少的内存重塑交替列 [英] reshape alternating columns in less time and using less memory

查看:85
本文介绍了在更少的时间和使用更少的内存重塑交替列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何更快地完成这个重塑,以便占用更少的内存?我的目标是重塑一个500,000行乘500列的数据帧,使用4 Gb RAM。

How can I do this reshape faster and so that it takes up less memory? My aim is to reshape a dataframe that is 500,000 rows by 500 columns with 4 Gb RAM.

下面是一个函数,它将产生一些可重现的数据:

Here's a function that will make some reproducible data:

make_example <- function(ndoc, ntop){
# doc numbers
V1 = seq(1:ndoc)
# filenames
V2 <- list("vector", size = ndoc)
for (i in 1:ndoc){
V2[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# topic proportions
tvals <- data.frame(matrix(runif(1:(ndoc*ntop)), ncol = ntop))
# topic number
tnumvals <- data.frame(matrix(sample(1:ntop, size = ndoc*ntop, replace = TRUE), ncol = ntop))
# now make topic props and topic numbers alternating columns (rather slow!)
alternating <- data.frame(c(matrix(c(tnumvals, tvals), 2, byrow = T)) )
# make colnames for topic number and topic props
ntopx <- sapply(1:ntop, function(j) paste0("ntop_",j))
ptopx <- sapply(1:ntop, function(j) paste0("ptop_",j))
tops <- c(rbind(ntopx,ptopx)) 
# make data frame
dat <- data.frame(V1 = V1,
                 V2 =  unlist(V2), 
                 alternating)
names(dat) <- c("docnum", "filename", tops)
# give df as result
return(dat)
}

set.seed(007)
dat <- make_example(500000, 500)

以下是我目前的方法(感谢 http://stackoverflow.com/a/8058714/1036500 ):

Here's my current method (thanks to http://stackoverflow.com/a/8058714/1036500):

library(reshape2)
NTOPICS = (ncol(dat) - 2 )/2
nam <- c('num', 'text', paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))

system.time( dat_l2 <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long', sep = ""))
system.time( dat.final2 <- dcast(dat_l2, dat_l2[,2] ~ dat_l2[,3], value.var = "proportion" ) )

一些时间,只是为 reshape ,因为这是最慢的步骤:

Some timings, just for the reshape since that's the slowest step:

make_example(5000,100) = 82秒

make_example(50000,200) = 2855秒(尝试第二步时崩溃)

make_example(50000,200) = 2855 sec (crashed on attempting the second step)

make_example(500000,500) =还不可能...

make_example(500000,500) = not yet possible...

还有什么其他方法更快, ( data.table this )?

What other methods are there that are faster and less memory intensive for this reshape (data.table, this)?

推荐答案

我非常怀疑当传递一个500000 x 500的数据帧时,这个小内存会成功。我不知道你是否能在有限的空间里做一些简单的动作。购买更多RAM。此外,reshape2是慢的,所以使用stats :: reshape的大东西。并提供有关分隔符的提示。

I doubt very much that this will succeed with that small amount of RAM when passing a 500000 x 500 dataframe. I wonder whether you could do even simple actions in that limited space. Buy more RAM. Furthermore, reshape2 is slow, so use stats::reshape for big stuff. And give it hints about what the separator is.

> set.seed(007)
> dat <- make_example(5, 3)
> dat
  docnum filename ntop_1     ptop_1 ntop_2    ptop_2 ntop_3    ptop_3
1      1    y8214      3 0.06564574      1 0.6799935      2 0.8470244
2      2    e6x39      2 0.62703876      1 0.2637199      3 0.4980761
3      3    34c19      3 0.49047504      3 0.1857143      3 0.7905856
4      4    1H0y6      2 0.97102441      3 0.1851432      2 0.8384639
5      5    P6zqy      3 0.36222085      3 0.3792967      3 0.4569039

> reshape(dat, direction="long", varying=3:8, sep="_")
    docnum filename time ntop       ptop id
1.1      1    y8214    1    3 0.06564574  1
2.1      2    e6x39    1    2 0.62703876  2
3.1      3    34c19    1    3 0.49047504  3
4.1      4    1H0y6    1    2 0.97102441  4
5.1      5    P6zqy    1    3 0.36222085  5
1.2      1    y8214    2    1 0.67999346  1
2.2      2    e6x39    2    1 0.26371993  2
3.2      3    34c19    2    3 0.18571426  3
4.2      4    1H0y6    2    3 0.18514322  4
5.2      5    P6zqy    2    3 0.37929675  5
1.3      1    y8214    3    2 0.84702439  1
2.3      2    e6x39    3    3 0.49807613  2
3.3      3    34c19    3    3 0.79058557  3
4.3      4    1H0y6    3    2 0.83846387  4
5.3      5    P6zqy    3    3 0.45690386  5

> system.time( dat <- make_example(5000,100) )
   user  system elapsed 
  2.925   0.131   3.043 
> system.time( dat2 <-  reshape(dat, direction="long", varying=3:202, sep="_"))
   user  system elapsed 
 16.766   8.608  25.272 

我想说,32 GB内存中大约1/5的内存在这个过程中被使用了250倍小于你的目标,所以我不感到惊讶,你的机器挂。 (它不应该有崩溃
R的作者喜欢你准确描述行为,我怀疑R进程停止响应,当它分页到虚拟内存)。我有性能问题,我需要解决使用32 GB时为700万记录x 100列的数据集。

I'd say that around 1/5 of total in 32 GB memory got used during that process that was 250 times smaller than your goal, so I'm not surprised that your machine hung. (It should not have "crashed". The authors of R would prefer that you give accurate descriptions of behavior and I suspect the R process stopped responding when it paged into virtual memory.) I have performance issues that I need to work around with a dataset that is 7 million records x 100 columns when using 32 GB.

这篇关于在更少的时间和使用更少的内存重塑交替列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆