split apply recombine,plyr,data.table in R [英] split apply recombine, plyr, data.table in R
问题描述
我在R中做了经典的拆分 - 应用 - 重组事件。我的数据集是一段时间内的一群公司。我正在做的是为每个公司运行回归,并返回残差,因此,我不是由公司聚合。 plyr
是伟大的,但它需要一个非常非常长的时间运行,当企业数量庞大。有没有办法用 data.table
?
I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr
is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table
?
示例数据:
dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
我需要根据每个id进行拆分(即1和2)。运行回归,返回残差并将其作为列附加到我的数据。有没有办法使用 data.table
?
I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table
?
推荐答案
我猜这需要按id排序正确排队。幸运的是,当您设置键时,会自动发生这种情况:
I'm guessing this needs to be sorted by "id" to line up properly. Luckily that happens automatically when you set the key:
dat <-read.table(text="dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
", header=TRUE, sep=",")
dtb <- data.table(dat)
setkey(dtb, "id")
dtb[, residuals(lm(val1 ~ val2)), by="id"]
#---------------
cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"])
#---------------
dte id val1 val2 id.1 V1
[1,] 2001-10-02 1 10 25 1 1.631688e-15
[2,] 2001-10-03 1 11 24 1 -3.263376e-15
[3,] 2001-10-04 1 12 23 1 1.631688e-15
[4,] 2001-10-02 2 13 22 2 0.000000e+00
[5,] 2001-10-03 2 14 21 2 0.000000e+00
> dat <- data.frame(dte=Sys.Date()+1:1000000,
id=sample(1:2, 1000000, repl=TRUE),
val1=runif(1000000), val2=runif(1000000) )
> dtb <- data.table(dat)
> setkey(dtb, "id")
> system.time( cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) )
user system elapsed
1.696 0.798 2.466
> system.time( dtb[,transform(.SD,r = residuals(lm(val1~val2))),by = "id"] )
user system elapsed
1.757 0.908 2.690
来自Matthew的编辑:
这对于v1.8.0在CRAN 。除了 transform
在 j
中的小主题 data.table wiki 点2:对于速度不要 transform / code>按组,
cbind()
之后。但是,:=
现在工作组在v1.8.1和是既简单和快速。看到我的答案为例证(但不需要投票)。
EDIT from Matthew :
This is all correct for v1.8.0 on CRAN. With the small addition that transform
in j
is the subject of data.table wiki point 2: "For speed don't transform()
by group, cbind()
afterwards". But, :=
now works by group in v1.8.1 and is both simple and fast. See my answer for illustration (but no need to vote for it).
好吧,我投了票。这里是控制台命令安装v 1.8.1在Mac(如果你有正确的XCode工具avaialble,因为它只有在源):
Well, I voted for it. Here is the console command to install v 1.8.1on a Mac (if you have the proper XCode tools avaialble, since it only there in source):
install.packages("data.table", repos= "http://R-Forge.R-project.org", type="source",
lib="/Library/Frameworks/R.framework/Versions/2.14/Resources/lib")
(由于某种原因,我不能获取Mac GUI软件包安装程序以读取r-forge作为存储库。)
(For some reason I could not get the Mac GUI Package Installer to read r-forge as a repository.)
这篇关于split apply recombine,plyr,data.table in R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!