按R中的数据子集进行计算 [英] Performing calculations by subsets of data in R
问题描述
我想为我的数据框的PERMNO列中的每个公司编号执行计算,其中的摘要可以在这里看到:
>摘要(companydataRETS)
PERMNO RET
最小。 :10000 Min。 :-0.971698
第一档:32716第一档:-0.011905
中位数:61735中位数:0.000000
平均值:56788平均值:0.000799
第三档:80280第三档:0.010989
最大。 :93436最大。 :19.000000
到目前为止,我的解决方案是创建一个包含所有可能公司编号的变量
compns< - companydataRETS [!duplicated(companydataRETS [,PERMNO]),PERMNO]
$ c $然后使用一个使用并行计算的foreach循环来调用我的函数get.rho(),然后执行所需的计算。$ / $ p
rhos < - foreach(i = 1:length(compns),.combine = rbind)%dopar%
get.rho子集(companydataRETS [,RET],companydataRETS $ PERMNO == compns [i]))
我测试了我的数据的一个子集,它的一切工作。问题是我有7200万的观测数据,即使离开电脑一夜之间工作,仍然没有完成。
我是R中的新手,所以我想我的代码结构可以改进,并且有更好的(更快,更少计算密集)的方式来执行相同的任务也许使用apply或with,这两者我都不明白)。任何建议?
解决方案正如 Joran ,我看了一下库 data.table
。对代码的修改是:
$ p $ library $ data $ table $ companydataRETS< - data.table(companydataRETS)
setkey(companydataRETS,PERMNO)
rhos < - foreach(i = 1:length(compns),.combine = rbind)%do%
get.rho(companydataRETS [J(compns [i])] $ RET)
使用 subset
),并使用 data.table
,并使用变量 compns
仅包含数据集中28659家公司中的30家。以下是两个版本的 system.time()
的输出:使用子集
:
用户........系统.....已过去
43.925 ... 12.413 ...... 56.337
使用 data.table
使用者.......系统.....已过期
0.229 ..... 0.047 ....... 0.276
(由于某些原因,使用%对于原始代码,做%
代替%dopar%
使得它运行得更快。 system.time() code> subset 是一个使用
%do%
的函数,在这种情况下两者中速度较快。 )
我已经离开了原来的代码,并且在5个小时后还没有完成,所以我放弃了。
$ b 编辑
<使用
data.table
还有一个更简单的方法,不需要使用 foreach
,这涉及到用 rhos < - companydataRETS [,get.rho(RET),by = PERMNO替换上面代码的最后一行]
I want to perform calculations for each company number in the column PERMNO of my data frame, the summary of which can be seen here:
> summary(companydataRETS)
PERMNO RET
Min. :10000 Min. :-0.971698
1st Qu.:32716 1st Qu.:-0.011905
Median :61735 Median : 0.000000
Mean :56788 Mean : 0.000799
3rd Qu.:80280 3rd Qu.: 0.010989
Max. :93436 Max. :19.000000
My solution so far was to create a variable with all possible company numbers
compns <- companydataRETS[!duplicated(companydataRETS[,"PERMNO"]),"PERMNO"]
And then use a foreach loop using parallel computing which calls my function get.rho() which in turn perform the desired calculations
rhos <- foreach (i=1:length(compns), .combine=rbind) %dopar%
get.rho(subset(companydataRETS[,"RET"],companydataRETS$PERMNO == compns[i]))
I tested it for a subset of my data and it all works. The problem is that I have 72 million observations, and even after leaving the computer working overnight, it still didn't finish.
I am new in R, so I imagine my code structure can be improved upon and there is a better (quicker, less computationally intensive) way to perform this same task (perhaps using apply or with, both of which I don't understand). Any suggestions?
As suggested by Joran, I looked into the library data.table
. The modifications to the code are
library(data.table)
companydataRETS <- data.table(companydataRETS)
setkey(companydataRETS,PERMNO)
rhos <- foreach (i=1:length(compns), .combine=rbind) %do%
get.rho(companydataRETS[J(compns[i])]$RET)
I ran the code as I originally had (using subset
) and once using data.table
, with the variable compns
comprising of only 30 of the 28659 companies in the dataset. Here are the outputs of system.time()
for the two versions:
Using subset
:
user........ system..... elapsed
43.925 ... 12.413...... 56.337
Using data.table
user....... system..... elapsed
0.229..... 0.047....... 0.276
(For some reason using %do%
instead of %dopar%
for the original code made it ran faster. The system.time()
for subset
is the one using %do%
, the faster of the two in this case.)
I had left the original code running overnight and it hadn't finished after 5 hours, so I gave up and killed it. With this small modification I had my results in less than 5 minutes (I think about 3 mins)!
EDIT
There is an even easier way to do it using data.table
, without the use of foreach
, which involves substituting the last line of the code above by
rhos <- companydataRETS[ , get.rho(RET), by=PERMNO]
这篇关于按R中的数据子集进行计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!