使用从 tidyr 收集改变了我的回归结果 [英] Using gather from tidyr changes my regression results
问题描述
当我运行下面的代码时,一切都按预期进行
When I run the code below, everything works as expected
# install.packages("dynlm")
# install.packages("tidyr")
require(dynlm)
require(tidyr)
Time <- 1950:1993
Y <- c(5820, 5843, 5917, 6054, 6099, 6365, 6440, 6465, 6449, 6658, 6698, 6740, 6931,
7089, 7384, 7703, 8005, 8163, 8506, 8737, 8842, 9022, 9425, 9752, 9602, 9711,
10121, 10425, 10744, 10876, 10746, 10770, 10782, 11179, 11617, 12015, 12336,
12568, 12903, 13029, 13093, 12899, 13110, 13391)
X <- c(6284, 6390, 6476, 6640, 6628, 6879, 7080, 7114, 7113, 7256, 7264, 7382, 7583, 7718,
8140, 8508, 8822, 9114, 9399, 9606, 9875, 10111, 10414, 11013, 10832, 10906, 11192,
11406, 11851, 12039, 12005, 12156, 12146, 12349, 13029, 13258, 13552, 13545, 13890,
14005, 14101, 14003, 14279, 14341)
data <- data.frame(Time, Y, X)
data_ts <- ts(data, start = 1950, end = 1993, frequency = 1)
Modell <- dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)) + log(L(X, 3))
+ log(L(X, 4)) + log(L(X, 5)), data = data_ts)
summary(Modell)
在这种情况下,我的摘要输出是这样的
My summary output in this case is this
...
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.059109 0.091926 -0.643 0.525
log(X) 0.883020 0.145754 6.058 9.17e-07 ***
log(L(X)) 0.004167 0.211420 0.020 0.984
log(L(X, 2)) -0.092880 0.207026 -0.449 0.657
log(L(X, 3)) -0.012016 0.210395 -0.057 0.955
log(L(X, 4)) 0.200596 0.212370 0.945 0.352
log(L(X, 5)) 0.014497 0.144103 0.101 0.920
...
现在,当我使用 gather() 为某些图定义新数据框时
Now, when I use gather() to define a new data frame for some plots
data_tidyr <- gather(data, "Key", "Value", -Time)
并重新运行上面的代码而不改变任何其他我得到这个摘要:
and re-run the above code not changing anything else I get this summary:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.05669 0.07546 -0.751 0.457
log(X) 0.82128 0.13486 6.090 3.53e-07 ***
log(L(X)) 0.17484 0.13365 1.308 0.198
log(L(X, 2)) NA NA NA NA
log(L(X, 3)) NA NA NA NA
log(L(X, 4)) NA NA NA NA
log(L(X, 5)) NA NA NA NA
我对这种行为感到困惑,因为收集操作(定义一个将列收集到行中的新数据框)与我用来运行回归的数据集无关(至少这是我的印象).不知何故使用 gather() 改变了计算的方式,但我看不出是怎么做的.将不胜感激!
I am puzzled by this behaviour as the gather operations (defining a new data frame with columns gathered into rows) has nothing do to with the data set I am using to run my regression (at least this was my impression). Somehow using gather() changes the way calculation is done, but I cannot see how. Help would be much appreciated!
一些数字:
- dynlm"版本 0.3-3
- R 版本:3.2.0(64 位)
好的,感谢您到目前为止的所有回答和评论,但问题仍然存在:环境中发生了什么?我想知道这是为什么以及如何发生.对我来说,这是一件严肃的事情,因为据我所知,避免一个函数调用对其他函数的非预期副作用正是像 R 这样的函数式语言试图实现的目标.现在,除非我在这里遗漏了什么,否则这种行为似乎与该意图不一致.
Ok thank you for all the answers and comments so far, but the question remains: WHAT is going on in the environment? I want to know why and how this happens. To me this is something serious, since to my understanding avoiding non-intended side-effects of one function call on others is precicly what functional languages like R are trying to achieve. Now, unless I am missing something here, this behaviour seems to be at odds with that intention.
推荐答案
这种意外变化的根本原因是 dplyr
(dplyr
, not tidyr
) 改变了 lag
函数的默认方法.gather
函数调用 dplyr::select_vars
,它通过命名空间加载 dplyr
并覆盖 lag.default
.
The underlying reason for this unexpected change is that dplyr
(dplyr
, not tidyr
) changes the default method of the lag
function. The gather
function calls dplyr::select_vars
, which loads dplyr
via namespace and overwrites lag.default
.
当您在公式中使用 L
时,dynlm
函数会在内部调用 lag
.方法 dispatch 然后找到 lag.default
.当通过命名空间加载 dplyr
时(它甚至不需要附加),会找到 dplyr
中的 lag.default
.
The dynlm
function internally calls lag
when you use L
in the formula. The method dispatch then finds lag.default
. When dplyr
is loaded via namespace (it does not even need to be attached), the lag.default
from dplyr
is found.
这两个滞后函数根本不同.在新的 R 会话中,您会发现以下区别:
The two lag functions are fundamentally different. In a new R session, you will find the following difference:
lag(1:3, 1)
## [1] 1 2 3
## attr(,"tsp")
## [1] 0 2 1
invisible(dplyr::mutate) # side effect: loads dplyr via namespace...
lag(1:3, 1)
## [1] NA 1 2
所以解决方案相当简单.只需自己覆盖 lag.default
函数.
So the solution is fairly simple. Just overwrite the lag.default
function yourself.
lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928
lag.default <- dplyr:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1951, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05669 0.82128 0.17484 NA
lag.default <- stats:::lag.default
dynlm(log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
## Time series regression with "ts" data:
## Start = 1952, End = 1993
##
## Call:
## dynlm(formula = log(Y) ~ log(X) + log(L(X)) + log(L(X, 2)), data = data_ts)
##
## Coefficients:
## (Intercept) log(X) log(L(X)) log(L(X, 2))
## -0.05476 0.83870 0.01818 0.13928
这篇关于使用从 tidyr 收集改变了我的回归结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!