线性模型和dplyr - 更好的解决方案? [英] Linear model and dplyr - a better solution?
问题描述
我在我最近问的问题中收到了很多好的反馈并引导使用dplyr来转换一些数据。我有一个lm()的问题,并尝试从这个转换的数据找到一个斜率,并认为我会打开一个新的问题。
首先我有数据看起来像这样:
Var1 Var2 Var3时间温度
awj 9/9/2014 20
awj 9 / 9/2014 15
awk 9/20/2014 10
awj 9/10/2014 0
bx L 9/12/2014 30
bx L 9/12/2014 10
byk 9/13/2014 20
byk 9/13/2014 15
czj 9/14/2014 20
czj 9/14/2014 10
czk 9/14/2014 11
cwl 9/10/2014 45
adj 9/22/2014 20
adk 9/15/2014 4
adl 9/15/2014 23
adk 9/15/2014 11
我希望以此形式(Slope和Pearson模拟图示):
V1 V2 V3斜坡皮尔逊
awj -3 -0.9
awk 2 0
adj 1.5 0.6
adk 0 0.5
adl -0.5 -0.6
bx L 12 0.7
byk 4 0.6
czj -1 -0.5
czk -3 -0.4
cwl -10 -0.9
斜率为线性 - 最小二乘法斜率。在理论上,脚本将如下所示:
library(dplyr)
data< read.table(clipboard,sep =\t,quote =,header = T)
newdata = summarize(group_by(data
,Var1
,Var2
,Var3
)
,Slope = lm(Temp〜Time)$ coeff [2]
,Pearson = cor(Time,Temp,method =pearson)
)
但是R抛出一个错误,因为它找不到Time或Temp。它可以运行 lm(data $ Temp〜data $ Time)$ coeff [2]
,但返回整个数据集的斜率,而不是我的子集寻找。 cor()
似乎在 group_by
部分中运行正常,所以我需要传递一个特定的语法 lm()
以类似的方式运行或完全使用不同的函数来获取从子集传递的斜率?
这里有几个问题。
- 如果您将数据分为3个变量(甚至是2个),则您没有足够的不同值才能运行线性回归模型
- Pearson需要两个数字值,而
Time
是一个转换为数字的因素不会很有意义 - 这里的第三个问题是您需要使用
do
才能运行线性模型
以下是仅在 V1上分组的图示
data%>%
/ pre>
group_by(Var1)%>%#如果您的真实数据集启用,您可以添加其他分组变量
do(mod = lm(Temp 〜%)$%
mutate(Slope = summary(mod)$ coeff [2])%>%
select(-mod)
#本地数据帧[3 x 2]
#组:< by row>
#
#Var1 Slope
#1 a 12.66667
#2 b -2.50000
#3 c -31.33333
如果您有两个数字变量,可以使用
do
为了计算相关性,例如(我将创建一些虚拟数字变量来说明)data%> ;%
mutate(test1 = sample(1:3,n(),replace = TRUE),#创建一些数值变量
test2 = sample(1:3,n(),replace = TRUE) )%>%
group_by(Var1)%>%
do(mod = lm(Temp〜Time,data =。),
mod2 = cor(。$ test1, test2,method =pearson))%>%
mutate(Slope = summary(mod)$ coeff [2],
Pearson = mod2 [1])%>%
选择(-mod,-mod2)
#来源:本地数据框[3 x 3]
#组:< by row>
#
#Var1斜率皮尔森
#1 a 12.66667 0.25264558
#2 b -2.50000 -0.09090909
#3 c -31.33333 0.30151134
奖金解决方案:您可以使用
data.table
package toolibrary(data.table)
setDT(data )[,list(Slope = summary(lm(Temp〜Time))$ coeff [2]),Var1]
#Var1 Slope
#1:a 12.66667
#2:b - 2.50000
#3:c -31.33333
或者如果我们还要创建一些虚拟变量
library(data.table)
setDT(data)[,`:=`(test1 = sample :3,.N,replace = TRUE),
test2 = sample(1:3,.N,replace = TRUE))] [,
list(Slope = summary(lm(Temp〜Time) )$ coeff [2],
Pearson = cor(test1,test2,method =pearson)),Var1]
#Var1 Slope Pearson
#1:a 12 .66667 -0.02159168
#2:b -2.50000 -0.81649658
#3:c -31.33333 -1.00000000
I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.
First I have data that looks like this:
Var1 Var2 Var3 Time Temp a w j 9/9/2014 20 a w j 9/9/2014 15 a w k 9/20/2014 10 a w j 9/10/2014 0 b x L 9/12/2014 30 b x L 9/12/2014 10 b y k 9/13/2014 20 b y k 9/13/2014 15 c z j 9/14/2014 20 c z j 9/14/2014 10 c z k 9/14/2014 11 c w l 9/10/2014 45 a d j 9/22/2014 20 a d k 9/15/2014 4 a d l 9/15/2014 23 a d k 9/15/2014 11
And I want it in the form of this (values for Slope and Pearson simulated for illustration):
V1 V2 V3 Slope Pearson a w j -3 -0.9 a w k 2 0 a d j 1.5 0.6 a d k 0 0.5 a d l -0.5 -0.6 b x L 12 0.7 b y k 4 0.6 c z j -1 -0.5 c z k -3 -0.4 c w l -10 -0.9
The slope being a linear-least-squares slope. In theory, the script would look like so:
library(dplyr) data <- read.table("clipboard",sep="\t",quote="",header=T) newdata = summarise(group_by(data ,Var1 ,Var2 ,Var3 ) ,Slope = lm(Temp ~ Time)$coeff[2] ,Pearson = cor(Time, Temp, method="pearson") )
But R throws an error like it can't find Time or Temp. It can run
lm(data$Temp ~ data$Time)$coeff[2]
, but returns the slope for the entire data set and not the subsetted form that I'm looking for.cor()
seems to run just fine in thegroup_by
section, so is there a specific syntax I need to pass tolm()
to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?解决方案You have several issues here.
- If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
- Pearson requires two numeric values, while
Time
is a factor which converting to numeric won't make much sense- The third issue here is that you will need to use
do
in order to run your linear modelHere's an illustration for grouping only on
V1
data %>% group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it do(mod = lm(Temp ~ Time, data = .)) %>% mutate(Slope = summary(mod)$coeff[2]) %>% select(-mod) # Source: local data frame [3 x 2] # Groups: <by row> # # Var1 Slope # 1 a 12.66667 # 2 b -2.50000 # 3 c -31.33333
If you do have two numeric variables, you can use
do
in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)data %>% mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables test2 = sample(1:3, n(), replace = TRUE)) %>% group_by(Var1) %>% do(mod = lm(Temp ~ Time, data = .), mod2 = cor(.$test1, .$test2, method = "pearson")) %>% mutate(Slope = summary(mod)$coeff[2], Pearson = mod2[1]) %>% select(-mod, -mod2) # Source: local data frame [3 x 3] # Groups: <by row> # # Var1 Slope Pearson # 1 a 12.66667 0.25264558 # 2 b -2.50000 -0.09090909 # 3 c -31.33333 0.30151134
Bonus solution: you can do this quite efficiently/easily with
data.table
package toolibrary(data.table) setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1] # Var1 Slope # 1: a 12.66667 # 2: b -2.50000 # 3: c -31.33333
Or if we want to create some dummy variables too
library(data.table) setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), test2 = sample(1:3, .N, replace = TRUE))][, list(Slope = summary(lm(Temp ~ Time))$coeff[2], Pearson = cor(test1, test2, method = "pearson")), Var1] # Var1 Slope Pearson # 1: a 12.66667 -0.02159168 # 2: b -2.50000 -0.81649658 # 3: c -31.33333 -1.00000000
这篇关于线性模型和dplyr - 更好的解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!