减少重塑的计算时间 [英] Reduce computing time for reshape

查看:54
本文介绍了减少重塑的计算时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集,我想将其从宽格式改成长格式:

I have the following dataset, which I would like to reshape from wide to long format:

Name     Code  CURRENCY   01/01/1980   02/01/1980   03/01/1980   04/01/1980
Abengoa  4256  USD        1.53         1.54         1.51         1.52      
Adidas   6783  USD        0.23         0.54         0.61         0.62      

数据由 1980 年到 2013 年每天不同公司的股票价格组成.因此,我的宽数据中有 8,612 列(大约 3,000 行).现在,我使用以下命令将数据重塑为长格式:

The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:

library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")

但是,对于大约 50MB 的 .csv 文件,它已经需要大约两个小时.计算时间不应该由弱硬件驱动,因为我在具有 16GB RAM 的 2.7 GHz Intel Core i7 上运行它.还有其他更有效的方法吗?

However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?

非常感谢!

推荐答案

在测试进行期间,我会发布我的评论作为答案供您考虑.尝试使用 stack 如下:

While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:

data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))

在许多情况下,stack 在处理这些类型的操作时非常有效,并且由于向量在 R 中是如何回收的,因此在前几列中重新添加也很快.

In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.

就此而言,这可能也值得考虑:

For that matter, this might also be worth considering:

data.frame(data[1:3],
           vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
           date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))

不过,我对如此小的数据样本进行基准测试持谨慎态度,因为我怀疑结果与在您的实际数据集上进行的基准测试不太具有可比性.

I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.

使用@RicardoSaporta 的基准测试程序,我根据我所谓的手动"data.frame 创建对data.table 进行了基准测试.您可以在此处查看基准测试的结果,数据集范围从 1000 行到 3000 行,以 500 行为增量,所有数据集均包含 8003 列(8000 个数据列,加上三个初始列).

Using @RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).

结果可以在这里看到:http://rpubs.com/mrdwab/reduce-computing-时间

Ricardo 是对的——似乎有大约 3000 行与基本的 R 方法产生了巨大差异(如果有人对这可能是什么有任何解释,那将会很有趣).但是这种手动"方法实际上比 stack 还要快,如果性能真的是首要考虑的话.

Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.

以下是最近三轮的结果:

Here are the results for just the last three runs:

data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1, 
    columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), 
    order = "relative"))
##   relative   test elapsed user.self sys.self replications
## 2    1.000 Manual   0.908     0.696    0.108            1
## 1    3.963     DT   3.598     3.564    0.012            1

rm(data, dateCols, nvc, dtt)

data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1, 
    columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), 
    order = "relative"))
##   relative   test elapsed user.self sys.self replications
## 2    1.000 Manual   2.841     1.044    0.296            1
## 1    1.694     DT   4.813     4.661    0.080            1

rm(data, dateCols, nvc, dtt)

data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1, 
    columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), 
    order = "relative"))
##   relative   test elapsed user.self sys.self replications
## 1     1.00     DT   7.223     5.769    0.112            1
## 2    29.27 Manual 211.416     1.560    0.952            1

哎哟!data.table 真的在最后一次运行时改变了表格!

Ouch! data.table really turns the tables on that last run!

这篇关于减少重塑的计算时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆