使用空间数据从ddply函数内部的多元线性回归(lm)中提取p值 [英] extracting p values from multiple linear regression (lm) inside of a ddply function using spatial data

查看:128
本文介绍了使用空间数据从ddply函数内部的多元线性回归(lm)中提取p值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组空间坐标(x,y)数据,这些数据在几年的过程中每个坐标都有一个响应变量.以下代码生成类似的数据帧:

df <- data.frame(
  id = rep(1:2, 2),
  x = rep(c(25, 30),10),
  y = rep(c(100, 200), 10),
  year = rep(1980:1989, 2),
  response = rnorm(20)
)

结果数据框:

head(df)

 id  x   y year   response
1  1 25 100 1980  0.1707431
2  2 30 200 1981  1.3562263
3  1 25 100 1982 -0.4590506
4  2 30 200 1983  1.3238410
5  1 25 100 1984  1.7765772
6  2 30 200 1985 -0.6258069

我想通过时间对每个单元格进行线性回归以获得响应变量的变化.为此,我使用plyr和ddply命令:

require(plyr)    
lm.df <- ddply(df, .(id), function(z)coef(lm(response ~ year, data = z)))

然后我重新组合空间数据(此处的组合示例有点简单,但适用于该示例.):

points <- data.frame(
  id = c(1,2),
  x = c(25,30),
  y = c(100,200)
)

lm.stack <-  merge(points, lm.df, by="id")
colnames(lm.stack) <- c("ID", "x", "y", "intercept", "slope")

print(lm.stack)
ID  x   y intercept       slope
1  1 25 100  257.7291 -0.12985632
2  2 30 200  173.3676 -0.08708068

这很好用.但是我想要的是能够提取每个lm模型的调整平方值,从而能够通过协调单元格与投影以显着性值着色的单元格图的最终目标来表示重要的趋势.这是我需要帮助的地方,非常感谢.谢谢!

解决方案

我建议结合使用软件包dplyrbroom. 这种方法会将您需要了解的有关模型的所有信息(有关系数的信息和有关模型本身的信息)作为数据框返回:

set.seed(25)

df <- data.frame(
  id = rep(1:2, 2),
  x = rep(c(25, 30),10),
  y = rep(c(100, 200), 10),
  year = rep(1980:1989, 2),
  response = rnorm(20)
)

df

#    id  x   y year    response
# 1   1 25 100 1980 -0.21183360
# 2   2 30 200 1981 -1.04159113
# 3   1 25 100 1982 -1.15330756
# 4   2 30 200 1983  0.32153150
# 5   1 25 100 1984 -1.50012988
# 6   2 30 200 1985 -0.44553326
# 7   1 25 100 1986  1.73404543
# 8   2 30 200 1987  0.51129562
# 9   1 25 100 1988  0.09964504
# 10  2 30 200 1989 -0.05789111
# 11  1 25 100 1980 -1.74278763
# 12  2 30 200 1981 -1.32495298
# 13  1 25 100 1982 -0.54793388
# 14  2 30 200 1983 -1.45638428
# 15  1 25 100 1984  0.08268682
# 16  2 30 200 1985  0.92757895
# 17  1 25 100 1986 -0.71676933
# 18  2 30 200 1987  0.96239968
# 19  1 25 100 1988  1.54588458
# 20  2 30 200 1989 -1.00976361



library(dplyr)
library(broom)


df %>% 
  group_by(id) %>%
  do({model = lm(response~year, data=.)    # create your model
      data.frame(tidy(model),              # get coefficient info
                 glance(model))})          # get model info


#   id        term     estimate    std.error statistic    p.value r.squared adj.r.squared     sigma statistic.1  p.value.1 df    logLik      AIC      BIC deviance df.residual
# 1  1 (Intercept) -492.2144842 213.19252113 -2.308779 0.04978386 0.3996362    0.32459069 0.9611139    5.325253 0.04987162  2 -12.67704 31.35409 32.26184 7.389919           8
# 2  1        year    0.2479705   0.10745580  2.307651 0.04987162 0.3996362    0.32459069 0.9611139    5.325253 0.04987162  2 -12.67704 31.35409 32.26184 7.389919           8
# 3  2 (Intercept) -258.6253012 196.88284243 -1.313600 0.22539989 0.1771294    0.07427055 0.8871395    1.722063 0.22582607  2 -11.87614 29.75227 30.66003 6.296132           8
# 4  2        year    0.1301582   0.09918521  1.312274 0.22582607 0.1771294    0.07427055 0.8871395    1.722063 0.22582607  2 -11.87614 29.75227 30.66003 6.296132           8

如果您确实希望在最终数据框中使用x和y列,则可以使用group_by(id,x,y).例如,如果您想要与您提供的输出相似的输出,则可以执行以下操作:

library(dplyr)
library(broom)
library(tidyr)

dd = 
    df %>% 
      group_by(id, x, y) %>%
      do({model = lm(response~year, data=.)    # create your model
      data.frame(tidy(model),              # get coefficient info
                 glance(model))})          # get model info

然后选择所需的信息:

 dd %>%
  select(id, x, y, term, estimate, adj.r.squared)

#   id  x   y        term     estimate adj.r.squared
# 1  1 25 100 (Intercept) -492.2144842    0.32459069
# 2  1 25 100        year    0.2479705    0.32459069
# 3  2 30 200 (Intercept) -258.6253012    0.07427055
# 4  2 30 200        year    0.1301582    0.07427055

其中截距为一行,坡度为一行.

或者,甚至将该数据框重塑为:

dd %>%
  select(id, x, y, term, estimate, adj.r.squared) %>%
  spread(term, estimate)

#   id  x   y adj.r.squared (Intercept)      year
# 1  1 25 100    0.32459069   -492.2145 0.2479705
# 2  2 30 200    0.07427055   -258.6253 0.1301582

yearslope(变量year的系数)

以类似的方式,您还可以使用包purrr创建一个新的数据框,该框具有带有相应信息的列表列:

set.seed(25)

df <- data.frame(
  id = rep(1:2, 2),
  x = rep(c(25, 30),10),
  y = rep(c(100, 200), 10),
  year = rep(1980:1989, 2),
  response = rnorm(20)
)

library(dplyr)
library(broom)
library(tidyr)
library(purrr)

dd = df %>%
  group_by(id, x, y) %>%                                  # for each combination of those variables
  nest() %>%                                              # nest data (rest of columns)
  mutate(Model = map(data, ~lm(response~year, data=.)),   # use nested data to build model of interest
         Coeff_Info = map(Model, tidy),                   # get coefficient info
         Model_Info = map(Model, glance)) %>%             # get model info
  ungroup()                                               # forget the grouping

# check how new dataset looks like
dd

# # A tibble: 2 x 7
#        id     x     y              data    Model           Coeff_Info            Model_Info
#     <int> <dbl> <dbl>            <list>   <list>               <list>                <list>
#   1     1    25   100 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
#   2     2    30   200 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>

您仍然可以访问所需的任何元素,但是请记住,某些列现在具有列表元素:

# get coefficient info for both models
dd$Coeff_Info

# [[1]]
#          term     estimate   std.error statistic    p.value
# 1 (Intercept) -492.2144842 213.1925211 -2.308779 0.04978386
# 2        year    0.2479705   0.1074558  2.307651 0.04987162
# 
# [[2]]
#          term     estimate    std.error statistic   p.value
# 1 (Intercept) -258.6253012 196.88284243 -1.313600 0.2253999
# 2        year    0.1301582   0.09918521  1.312274 0.2258261


# get the r squared value for 1st model
dd %>%                    # from new dataset
  filter(id == 1) %>%     # keep all info / rows of 1st model
  pull(Model_Info) %>%    # get model info
  map_dbl("r.squared")    # show r squared

# [1] 0.3996362

或者,或者

dd %>%                        # from new dataset
  unnest(id, Model_Info) %>%  # umnest id and model info
  filter(id == 1) %>%         # keep row of 1st model
  pull(r.squared)             # show the r squared

# [1] 0.3996362

I have a set of spatial coordinate (x,y) data that has a response variable for each coordinate over the course of several years. The following code generates a similar data frame:

df <- data.frame(
  id = rep(1:2, 2),
  x = rep(c(25, 30),10),
  y = rep(c(100, 200), 10),
  year = rep(1980:1989, 2),
  response = rnorm(20)
)

The resulting data frame:

head(df)

 id  x   y year   response
1  1 25 100 1980  0.1707431
2  2 30 200 1981  1.3562263
3  1 25 100 1982 -0.4590506
4  2 30 200 1983  1.3238410
5  1 25 100 1984  1.7765772
6  2 30 200 1985 -0.6258069

I want to run a linear regression on each cell through time to get the change in response variable. To do this, I use plyr and the ddply command:

require(plyr)    
lm.df <- ddply(df, .(id), function(z)coef(lm(response ~ year, data = z)))

Then I recombine w/ the spatial data (the example here of the recombination is a bit simple, but works for the example.):

points <- data.frame(
  id = c(1,2),
  x = c(25,30),
  y = c(100,200)
)

lm.stack <-  merge(points, lm.df, by="id")
colnames(lm.stack) <- c("ID", "x", "y", "intercept", "slope")

print(lm.stack)
ID  x   y intercept       slope
1  1 25 100  257.7291 -0.12985632
2  2 30 200  173.3676 -0.08708068

This works great. But what I want, is to be able to extract the adj r squared value of each lm model to be able to denote significant trends by coordinate cell w/ the ultimate goal of projecting a map of cells colored by significance value. This is where I need help and am greatly appreciative. Thanks!

解决方案

I would suggest a combination of packages dplyr and broom. This approach will return everything you need to know about the model (info about coefficients and info about model itself) as a dataframe:

set.seed(25)

df <- data.frame(
  id = rep(1:2, 2),
  x = rep(c(25, 30),10),
  y = rep(c(100, 200), 10),
  year = rep(1980:1989, 2),
  response = rnorm(20)
)

df

#    id  x   y year    response
# 1   1 25 100 1980 -0.21183360
# 2   2 30 200 1981 -1.04159113
# 3   1 25 100 1982 -1.15330756
# 4   2 30 200 1983  0.32153150
# 5   1 25 100 1984 -1.50012988
# 6   2 30 200 1985 -0.44553326
# 7   1 25 100 1986  1.73404543
# 8   2 30 200 1987  0.51129562
# 9   1 25 100 1988  0.09964504
# 10  2 30 200 1989 -0.05789111
# 11  1 25 100 1980 -1.74278763
# 12  2 30 200 1981 -1.32495298
# 13  1 25 100 1982 -0.54793388
# 14  2 30 200 1983 -1.45638428
# 15  1 25 100 1984  0.08268682
# 16  2 30 200 1985  0.92757895
# 17  1 25 100 1986 -0.71676933
# 18  2 30 200 1987  0.96239968
# 19  1 25 100 1988  1.54588458
# 20  2 30 200 1989 -1.00976361



library(dplyr)
library(broom)


df %>% 
  group_by(id) %>%
  do({model = lm(response~year, data=.)    # create your model
      data.frame(tidy(model),              # get coefficient info
                 glance(model))})          # get model info


#   id        term     estimate    std.error statistic    p.value r.squared adj.r.squared     sigma statistic.1  p.value.1 df    logLik      AIC      BIC deviance df.residual
# 1  1 (Intercept) -492.2144842 213.19252113 -2.308779 0.04978386 0.3996362    0.32459069 0.9611139    5.325253 0.04987162  2 -12.67704 31.35409 32.26184 7.389919           8
# 2  1        year    0.2479705   0.10745580  2.307651 0.04987162 0.3996362    0.32459069 0.9611139    5.325253 0.04987162  2 -12.67704 31.35409 32.26184 7.389919           8
# 3  2 (Intercept) -258.6253012 196.88284243 -1.313600 0.22539989 0.1771294    0.07427055 0.8871395    1.722063 0.22582607  2 -11.87614 29.75227 30.66003 6.296132           8
# 4  2        year    0.1301582   0.09918521  1.312274 0.22582607 0.1771294    0.07427055 0.8871395    1.722063 0.22582607  2 -11.87614 29.75227 30.66003 6.296132           8

You can use group_by(id,x,y) if you really want your x and y columns in your final dataframe. For example if you want a similar output to the one you provided you can do :

library(dplyr)
library(broom)
library(tidyr)

dd = 
    df %>% 
      group_by(id, x, y) %>%
      do({model = lm(response~year, data=.)    # create your model
      data.frame(tidy(model),              # get coefficient info
                 glance(model))})          # get model info

Then select the info you need:

 dd %>%
  select(id, x, y, term, estimate, adj.r.squared)

#   id  x   y        term     estimate adj.r.squared
# 1  1 25 100 (Intercept) -492.2144842    0.32459069
# 2  1 25 100        year    0.2479705    0.32459069
# 3  2 30 200 (Intercept) -258.6253012    0.07427055
# 4  2 30 200        year    0.1301582    0.07427055

where you get one row for the intercept and one row for the slope.

Or, even reshape that data frame to:

dd %>%
  select(id, x, y, term, estimate, adj.r.squared) %>%
  spread(term, estimate)

#   id  x   y adj.r.squared (Intercept)      year
# 1  1 25 100    0.32459069   -492.2145 0.2479705
# 2  2 30 200    0.07427055   -258.6253 0.1301582

column year is the slope (coefficient of variable year)

In a similar way you can also use package purrr to create a new data frame that has list columns with the corresponding info:

set.seed(25)

df <- data.frame(
  id = rep(1:2, 2),
  x = rep(c(25, 30),10),
  y = rep(c(100, 200), 10),
  year = rep(1980:1989, 2),
  response = rnorm(20)
)

library(dplyr)
library(broom)
library(tidyr)
library(purrr)

dd = df %>%
  group_by(id, x, y) %>%                                  # for each combination of those variables
  nest() %>%                                              # nest data (rest of columns)
  mutate(Model = map(data, ~lm(response~year, data=.)),   # use nested data to build model of interest
         Coeff_Info = map(Model, tidy),                   # get coefficient info
         Model_Info = map(Model, glance)) %>%             # get model info
  ungroup()                                               # forget the grouping

# check how new dataset looks like
dd

# # A tibble: 2 x 7
#        id     x     y              data    Model           Coeff_Info            Model_Info
#     <int> <dbl> <dbl>            <list>   <list>               <list>                <list>
#   1     1    25   100 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
#   2     2    30   200 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>

You can still access any element you want, but remember that some of your columns now have list elements:

# get coefficient info for both models
dd$Coeff_Info

# [[1]]
#          term     estimate   std.error statistic    p.value
# 1 (Intercept) -492.2144842 213.1925211 -2.308779 0.04978386
# 2        year    0.2479705   0.1074558  2.307651 0.04987162
# 
# [[2]]
#          term     estimate    std.error statistic   p.value
# 1 (Intercept) -258.6253012 196.88284243 -1.313600 0.2253999
# 2        year    0.1301582   0.09918521  1.312274 0.2258261


# get the r squared value for 1st model
dd %>%                    # from new dataset
  filter(id == 1) %>%     # keep all info / rows of 1st model
  pull(Model_Info) %>%    # get model info
  map_dbl("r.squared")    # show r squared

# [1] 0.3996362

Or, alternatively

dd %>%                        # from new dataset
  unnest(id, Model_Info) %>%  # umnest id and model info
  filter(id == 1) %>%         # keep row of 1st model
  pull(r.squared)             # show the r squared

# [1] 0.3996362

这篇关于使用空间数据从ddply函数内部的多元线性回归(lm)中提取p值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆