减少重塑的计算时间 [英] Reduce computing time for reshape
问题描述
我有以下数据集,我想将其从宽格式改成长格式:
I have the following dataset, which I would like to reshape from wide to long format:
Name Code CURRENCY 01/01/1980 02/01/1980 03/01/1980 04/01/1980
Abengoa 4256 USD 1.53 1.54 1.51 1.52
Adidas 6783 USD 0.23 0.54 0.61 0.62
数据由 1980 年到 2013 年每天不同公司的股票价格组成.因此,我的宽数据中有 8,612 列(大约 3,000 行).现在,我使用以下命令将数据重塑为长格式:
The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:
library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")
但是,对于大约 50MB 的 .csv 文件,它已经需要大约两个小时.计算时间不应该由弱硬件驱动,因为我在具有 16GB RAM 的 2.7 GHz Intel Core i7 上运行它.还有其他更有效的方法吗?
However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?
非常感谢!
推荐答案
在测试进行期间,我会发布我的评论作为答案供您考虑.尝试使用 stack
如下:
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack
as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
在许多情况下,stack
在处理这些类型的操作时非常有效,并且由于向量在 R 中是如何回收的,因此在前几列中重新添加也很快.
In many cases, stack
works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
就此而言,这可能也值得考虑:
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
不过,我对如此小的数据样本进行基准测试持谨慎态度,因为我怀疑结果与在您的实际数据集上进行的基准测试不太具有可比性.
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
使用@RicardoSaporta 的基准测试程序,我根据我所谓的手动"data.frame
创建对data.table
进行了基准测试.您可以在此处查看基准测试的结果,数据集范围从 1000 行到 3000 行,以 500 行为增量,所有数据集均包含 8003 列(8000 个数据列,加上三个初始列).
Using @RicardoSaporta's benchmarking procedure, I have benchmarked data.table
against what I've called "Manual" data.frame
creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
结果可以在这里看到:http://rpubs.com/mrdwab/reduce-computing-时间
Ricardo 是对的——似乎有大约 3000 行与基本的 R 方法产生了巨大差异(如果有人对这可能是什么有任何解释,那将会很有趣).但是这种手动"方法实际上比 stack
还要快,如果性能真的是首要考虑的话.
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack
, if performance really is the primary concern.
以下是最近三轮的结果:
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
哎哟!data.table
真的在最后一次运行时改变了表格!
Ouch! data.table
really turns the tables on that last run!
这篇关于减少重塑的计算时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!