按R中的数据子集进行计算 [英] Performing calculations by subsets of data in R

查看:153
本文介绍了按R中的数据子集进行计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为我的数据框的PERMNO列中的每个公司编号执行计算,其中的摘要可以在这里看到:

 >摘要(companydataRETS)
PERMNO RET
最小。 :10000 Min。 :-0.971698
第一档:32716第一档:-0.011905
中位数:61735中位数:0.000000
平均值:56788平均值:0.000799
第三档:80280第三档:0.010989
最大。 :93436最大。 :19.000000

到目前为止,我的解决方案是创建一个包含所有可能公司编号的变量

  compns<  -  companydataRETS [!duplicated(companydataRETS [,PERMNO]),PERMNO] 



  rhos < -  foreach(i = 1:length(compns),.combine = rbind)%dopar%
get.rho子集(companydataRETS [,RET],companydataRETS $ PERMNO == compns [i]))

我测试了我的数据的一个子集,它的一切工作。问题是我有7200万的观测数据,即使离开电脑一夜之间工作,仍然没有完成。

我是R中的新手,所以我想我的代码结构可以改进,并且有更好的(更快,更少计算密集)的方式来执行相同的任务也许使用apply或with,这两者我都不明白)。任何建议?

解决方案

正如 Joran ,我看了一下库 data.table 。对代码的修改是:

$ p $ library $ data $ table $ companydataRETS< - data.table(companydataRETS)
setkey(companydataRETS,PERMNO)

rhos < - foreach(i = 1:length(compns),.combine = rbind)%do%
get.rho(companydataRETS [J(compns [i])] $ RET)

使用 subset ),并使用 data.table ,并使用变量 compns 仅包含数据集中28659家公司中的30家。以下是两个版本的 system.time()的输出:使用子集


用户........系统.....已过去

43.925 ... 12.413 ...... 56.337

使用 data.table


使用者.......系统.....已过期
0.229 ..... 0.047 ....... 0.276

(由于某些原因,使用%对于原始代码,做%代替%dopar%使得它运行得更快。 system.time() code> subset 是一个使用%do%的函数,在这种情况下两者中速度较快。 )



我已经离开了原来的代码,并且在5个小时后还没有完成,所以我放弃了。
$ b 编辑














<使用 data.table 还有一个更简单的方法,不需要使用 foreach ,这涉及到用

  rhos < -  companydataRETS [,get.rho(RET),by = PERMNO替换上面代码的最后一行] 


I want to perform calculations for each company number in the column PERMNO of my data frame, the summary of which can be seen here:

> summary(companydataRETS)
     PERMNO           RET           
 Min.   :10000   Min.   :-0.971698  
 1st Qu.:32716   1st Qu.:-0.011905  
 Median :61735   Median : 0.000000  
 Mean   :56788   Mean   : 0.000799  
 3rd Qu.:80280   3rd Qu.: 0.010989  
 Max.   :93436   Max.   :19.000000  

My solution so far was to create a variable with all possible company numbers

compns <- companydataRETS[!duplicated(companydataRETS[,"PERMNO"]),"PERMNO"]

And then use a foreach loop using parallel computing which calls my function get.rho() which in turn perform the desired calculations

rhos <- foreach (i=1:length(compns), .combine=rbind) %dopar% 
      get.rho(subset(companydataRETS[,"RET"],companydataRETS$PERMNO == compns[i]))

I tested it for a subset of my data and it all works. The problem is that I have 72 million observations, and even after leaving the computer working overnight, it still didn't finish.

I am new in R, so I imagine my code structure can be improved upon and there is a better (quicker, less computationally intensive) way to perform this same task (perhaps using apply or with, both of which I don't understand). Any suggestions?

解决方案

As suggested by Joran, I looked into the library data.table. The modifications to the code are

library(data.table) 
companydataRETS <- data.table(companydataRETS)
setkey(companydataRETS,PERMNO)

rhos <- foreach (i=1:length(compns), .combine=rbind) %do% 
      get.rho(companydataRETS[J(compns[i])]$RET)

I ran the code as I originally had (using subset) and once using data.table, with the variable compns comprising of only 30 of the 28659 companies in the dataset. Here are the outputs of system.time() for the two versions:

Using subset:

user........ system..... elapsed
43.925 ... 12.413...... 56.337

Using data.table

user....... system..... elapsed
0.229..... 0.047....... 0.276

(For some reason using %do% instead of %dopar% for the original code made it ran faster. The system.time() for subset is the one using %do%, the faster of the two in this case.)

I had left the original code running overnight and it hadn't finished after 5 hours, so I gave up and killed it. With this small modification I had my results in less than 5 minutes (I think about 3 mins)!

EDIT

There is an even easier way to do it using data.table, without the use of foreach, which involves substituting the last line of the code above by

rhos <- companydataRETS[ , get.rho(RET), by=PERMNO]

这篇关于按R中的数据子集进行计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆