拆分在 R 中应用重组、plyr、data.table [英] split apply recombine, plyr, data.table in R

查看:19
本文介绍了拆分在 R 中应用重组、plyr、data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 R 中做经典的拆分-应用-重组.随着时间的推移,我的数据集是一堆公司.我正在做的应用是对每个公司进行回归并返回残差,因此,我没有按公司聚合.plyr 对此非常有用,但是当公司数量很大时,它需要很长时间才能运行.有没有办法用 data.table 做到这一点?

I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table?

样本数据:

dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21

我需要按每个 id(即 1 和 2)进行拆分.运行回归,返回残差并将其作为一列附加到我的数据中.有没有办法使用 data.table 来做到这一点?

I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table?

推荐答案

我猜这需要按id"排序才能正确排列.幸运的是,当您设置密钥时,这会自动发生:

I'm guessing this needs to be sorted by "id" to line up properly. Luckily that happens automatically when you set the key:

dat <-read.table(text="dte, id, val1, val2
 2001-10-02, 1, 10, 25
 2001-10-03, 1, 11, 24
 2001-10-04, 1, 12, 23
 2001-10-02, 2, 13, 22
 2001-10-03, 2, 14, 21
 ", header=TRUE, sep=",")
 dtb <- data.table(dat)
 setkey(dtb, "id")
 dtb[, residuals(lm(val1 ~ val2)), by="id"]
#---------------
cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"])
#---------------
            dte id val1 val2 id.1            V1
[1,] 2001-10-02  1   10   25    1  1.631688e-15
[2,] 2001-10-03  1   11   24    1 -3.263376e-15
[3,] 2001-10-04  1   12   23    1  1.631688e-15
[4,] 2001-10-02  2   13   22    2  0.000000e+00
[5,] 2001-10-03  2   14   21    2  0.000000e+00



> dat <- data.frame(dte=Sys.Date()+1:1000000, 
                    id=sample(1:2, 1000000, repl=TRUE),  
                    val1=runif(1000000),  val2=runif(1000000) )
> dtb <- data.table(dat)
> setkey(dtb, "id")
> system.time(  cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) )
   user  system elapsed 
  1.696   0.798   2.466 
> system.time( dtb[,transform(.SD,r = residuals(lm(val1~val2))),by = "id"] )
   user  system elapsed 
  1.757   0.908   2.690 

来自马修的编辑:这对于 CRAN 上的 v1.8.0 来说都是正确的.j 中的 transformdata.table wiki 点 2:为了速度,不要按组transform()cbind()> 之后".但是,:= 现在在 v1.8.1 中按组工作,既简单又快速.请参阅我的回答以进行说明(但无需投票).

EDIT from Matthew : This is all correct for v1.8.0 on CRAN. With the small addition that transform in j is the subject of data.table wiki point 2: "For speed don't transform() by group, cbind() afterwards". But, := now works by group in v1.8.1 and is both simple and fast. See my answer for illustration (but no need to vote for it).

好吧,我投了赞成票.这是在 Mac 上安装 v 1.8.1 的控制台命令(如果您有合适的 XCode 工具可用,因为它只在源代码中):

Well, I voted for it. Here is the console command to install v 1.8.1on a Mac (if you have the proper XCode tools avaialble, since it only there in source):

install.packages("data.table", repos= "http://R-Forge.R-project.org", type="source", 
               lib="/Library/Frameworks/R.framework/Versions/2.14/Resources/lib")

(由于某种原因,我无法让 Mac GUI 软件包安装程序将 r-forge 读取为存储库.)

(For some reason I could not get the Mac GUI Package Installer to read r-forge as a repository.)

这篇关于拆分在 R 中应用重组、plyr、data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆