线性模型和dplyr - 更好的解决方案？ [英] Linear model and dplyr - a better solution?

查看：74 发布时间：2017/7/13 20:21:44 r dplyr

本文介绍了线性模型和dplyr - 更好的解决方案？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在我最近问的问题中收到了很多好的反馈并引导使用dplyr来转换一些数据。我有一个lm（）的问题，并尝试从这个转换的数据找到一个斜率，并认为我会打开一个新的问题。

首先我有数据看起来像这样：

  Var1 Var2 Var3时间温度
awj 9/9/2014 20 
awj 9 / 9/2014 15 
awk 9/20/2014 10 
awj 9/10/2014 0 
bx L 9/12/2014 30 
bx L 9/12/2014 10 
byk 9/13/2014 20 
byk 9/13/2014 15 
czj 9/14/2014 20 
czj 9/14/2014 10 
czk 9/14/2014 11 
cwl 9/10/2014 45 
adj 9/22/2014 20 
adk 9/15/2014 4 
adl 9/15/2014 23 
adk 9/15/2014 11

我希望以此形式（Slope和Pearson模拟图示）：

  V1 V2 V3斜坡皮尔逊
awj -3 -0.9 
awk 2 0 
adj 1.5 0.6 
adk 0 0.5 
adl -0.5 -0.6 
bx L 12 0.7 
byk 4 0.6 
czj -1 -0.5 
czk -3 -0.4 
cwl -10 -0.9

斜率为线性 - 最小二乘法斜率。在理论上，脚本将如下所示：

  library（dplyr）
 
 data< read.table（clipboard，sep =\t，quote =，header = T）
 
 newdata = summarize（group_by（data 
，Var1 
 ，Var2 
，Var3 
）
，Slope = lm（Temp〜Time）$ coeff [2] 
，Pearson = cor（Time，Temp，method =pearson） 
）

但是R抛出一个错误，因为它找不到Time或Temp。它可以运行 lm（data $ Temp〜data $ Time）$ coeff [2] ，但返回整个数据集的斜率，而不是我的子集寻找。 cor（）似乎在 group_by 部分中运行正常，所以我需要传递一个特定的语法 lm（）以类似的方式运行或完全使用不同的函数来获取从子集传递的斜率？

解决方案

这里有几个问题。

如果您将数据分为3个变量（甚至是2个），则您没有足够的不同值才能运行线性回归模型

Pearson需要两个数字值，而 Time 是一个转换为数字的因素不会很有意义

这里的第三个问题是您需要使用 do 才能运行线性模型

以下是仅在 V1上分组的图示

  data％>％
 group_by（Var1）％>％＃如果您的真实数据集启用，您可以添加其他分组变量
 do（mod = lm（Temp 〜％）$％
 mutate（Slope = summary（mod）$ coeff [2]）％>％
 select（-mod）
＃本地数据帧[3 x 2] 
＃组：< by row> 
＃
＃Var1 Slope 
＃1 a 12.66667 
＃2 b -2.50000 
＃3 c -31.33333 
  / pre> 
 
 
 
 
 如果您有两个数字变量，可以使用 do 为了计算相关性，例如（我将创建一些虚拟数字变量来说明）
  data％> ;％
 mutate（test1 = sample（1：3，n（），replace = TRUE），＃创建一些数值变量
 test2 = sample（1：3，n（），replace = TRUE） ）％>％
 group_by（Var1）％>％
 do（mod = lm（Temp〜Time，data =。），
 mod2 = cor（。$ test1， test2，method =pearson））％>％
 mutate（Slope = summary（mod）$ coeff [2]，
 Pearson = mod2 [1]）％>％
选择（-mod，-mod2）
 
 
＃来源：本地数据框[3 x 3] 
＃组：< by row> 
＃
＃Var1斜率皮尔森
＃1 a 12.66667 0.25264558 
＃2 b -2.50000 -0.09090909 
＃3 c -31.33333 0.30151134 
  
 
 
 
 
 
 奖金解决方案：您可以使用 data.table  package too 
  library（data.table）
 setDT（data ）[，list（Slope = summary（lm（Temp〜Time））$ coeff [2]），Var1] 
＃Var1 Slope 
＃1：a 12.66667 
＃2：b  - 2.50000 
＃3：c -31.33333 
  
或者如果我们还要创建一些虚拟变量
  library（data.table）
 setDT（data）[，`：=`（test1 = sample ：3，.N，replace = TRUE），
 test2 = sample（1：3，.N，replace = TRUE））] [，
 list（Slope = summary（lm（Temp〜Time） ）$ coeff [2]，
 Pearson = cor（test1，test2，method =pearson）），Var1] 
＃Var1 Slope Pearson 
＃1：a 12 .66667 -0.02159168 
＃2：b -2.50000 -0.81649658 
＃3：c -31.33333 -1.00000000 
  
 
I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.

First I have data that looks like this:
Var1    Var2    Var3    Time           Temp
a       w       j       9/9/2014       20
a       w       j       9/9/2014       15
a       w       k       9/20/2014       10
a       w       j       9/10/2014       0
b       x       L       9/12/2014       30
b       x       L       9/12/2014       10
b       y       k       9/13/2014       20
b       y       k       9/13/2014       15
c       z       j       9/14/2014       20
c       z       j       9/14/2014       10
c       z       k       9/14/2014       11
c       w       l       9/10/2014       45
a       d       j       9/22/2014       20
a       d       k       9/15/2014       4
a       d       l       9/15/2014       23
a       d       k       9/15/2014       11
And I want it in the form of this (values for Slope and Pearson simulated for illustration):
V1  V2  V3  Slope   Pearson
a   w   j   -3      -0.9
a   w   k   2       0
a   d   j   1.5     0.6
a   d   k   0       0.5
a   d   l   -0.5    -0.6
b   x   L   12      0.7
b   y   k   4       0.6
c   z   j   -1      -0.5
c   z   k   -3      -0.4
c   w   l   -10     -0.9
The slope being a linear-least-squares slope. In theory, the script would look like so:
library(dplyr)

data <- read.table("clipboard",sep="\t",quote="",header=T)

newdata = summarise(group_by(data
                              ,Var1
                              ,Var2
                              ,Var3                            
                              )
                     ,Slope = lm(Temp ~ Time)$coeff[2]                 
                     ,Pearson = cor(Time, Temp, method="pearson")
                     )
But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2], but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor() seems to run just fine in the group_by section, so is there a specific syntax I need to pass to lm() to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?
 解决方案 
You have several issues here. 

If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
The third issue here is that you will need to use do in order to run your linear model
Here's an illustration for grouping only on V1
data %>%
  group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it
  do(mod = lm(Temp ~ Time, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2]) %>%
  select(-mod)
# Source: local data frame [3 x 2]
# Groups: <by row>
#   
#   Var1     Slope
# 1    a  12.66667
# 2    b  -2.50000
# 3    c -31.33333 




If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)
data %>%
  mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables
         test2 = sample(1:3, n(), replace = TRUE)) %>%
  group_by(Var1) %>%
  do(mod = lm(Temp ~ Time, data = .),
     mod2 = cor(.$test1, .$test2, method = "pearson")) %>%
  mutate(Slope = summary(mod)$coeff[2],
         Pearson = mod2[1]) %>%
  select(-mod, -mod2)


# Source: local data frame [3 x 3]
# Groups: <by row>
#   
#   Var1     Slope     Pearson
# 1    a  12.66667  0.25264558
# 2    b  -2.50000 -0.09090909
# 3    c -31.33333  0.30151134




Bonus solution: you can do this quite efficiently/easily with data.table package too
library(data.table)
setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1]
#    Var1     Slope
# 1:    a  12.66667
# 2:    b  -2.50000
# 3:    c -31.33333
Or if we want to create some dummy variables too
library(data.table)
setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), 
                   test2 = sample(1:3, .N, replace = TRUE))][, 
                   list(Slope = summary(lm(Temp ~ Time))$coeff[2],
                        Pearson = cor(test1, test2, method = "pearson")), Var1]
#    Var1     Slope     Pearson
# 1:    a  12.66667 -0.02159168
# 2:    b  -2.50000 -0.81649658
# 3:    c -31.33333 -1.00000000


                        
这篇关于线性模型和dplyr  - 更好的解决方案？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

线性模型和dplyr - 更好的解决方案？ [英] Linear model and dplyr - a better solution?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

线性模型和dplyr - 更好的解决方案？ [英] Linear model and dplyr - a better solution?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭