为列表列数据框的每一行拟合不同的模型 [英] Fit a different model for each row of a list-columns data frame
问题描述
用tidyverse中的列表-列数据结构拟合随数据帧的行而变化的不同模型公式的最佳方法是什么?
What is the best way to fit different model formulae that vary by the row of a data frame with the list-columns data structure in tidyverse?
在R中哈德利(Hadley)的数据科学学院(Data Science)展示了一个出色的示例,说明了如何使用列表列数据结构以及如何轻松拟合许多模型( http://r4ds.had.co.nz/many-models.html#gapminder )。我正在尝试找到一种方法来适合具有稍微不同的公式的许多模型。在下面的示例中(改编自他的原始示例),为每个洲拟合不同模型的最佳方法是什么?
In R for Data Science, Hadley presents a terrific example of how to use the list-columns data structure and fit many models easily (http://r4ds.had.co.nz/many-models.html#gapminder). I am trying to find a way to fit many models with slightly different formulae. In the below example adapted from his original example, what is the best way to fit a different model for each continent?
library(gapminder)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
by_continent <- gapminder %>%
group_by(continent) %>%
nest()
by_continent <- by_continent %>%
mutate(model = map(data, ~lm(lifeExp ~ year, data = .)))
by_continent %>%
mutate(glance=map(model, glance)) %>%
unnest(glance, .drop=T)
## A tibble: 5 × 12
# continent r.squared adj.r.squared sigma statistic p.value df
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 Asia 0.4356350 0.4342026 8.9244419 304.1298 6.922751e-51 2
#2 Europe 0.4984659 0.4970649 3.8530964 355.8099 1.344184e-55 2
#3 Africa 0.2987543 0.2976269 7.6685811 264.9929 6.780085e-50 2
#4 Americas 0.4626467 0.4608435 6.8618439 256.5699 4.354220e-42 2
#5 Oceania 0.9540678 0.9519800 0.8317499 456.9671 3.299327e-16 2
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
## deviance <dbl>, df.residual <int>
我知道我可以通过遍历by_continent来做到这一点(效率不高,因为它估计了每个大洲的每个模型:
I know I can do it by iterating through by_continent (not efficient as it estimates each model for every continent:
formulae <- list(
Asia=~lm(lifeExp ~ year, data = .),
Europe=~lm(lifeExp ~ year + pop, data = .),
Africa=~lm(lifeExp ~ year + gdpPercap, data = .),
Americas=~lm(lifeExp ~ year - 1, data = .),
Oceania=~lm(lifeExp ~ year + pop + gdpPercap, data = .)
)
for (i in 1:nrow(by_continent)) {
by_continent$model[[i]] <- map(by_continent$data, formulae[[i]])[[i]]
}
by_continent %>%
mutate(glance=map(model, glance)) %>%
unnest(glance, .drop=T)
## A tibble: 5 × 12
# continent r.squared adj.r.squared sigma statistic p.value df
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 Asia 0.4356350 0.4342026 8.9244419 304.1298 6.922751e-51 2
#2 Europe 0.4984677 0.4956580 3.8584819 177.4093 3.186760e-54 3
#3 Africa 0.4160797 0.4141991 7.0033542 221.2506 2.836552e-73 3
#4 Americas 0.9812082 0.9811453 8.9703814 15612.1901 4.227928e-260 1
#5 Oceania 0.9733268 0.9693258 0.6647653 243.2719 6.662577e-16 4
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
## deviance <dbl>, df.residual <int>
但是有可能做到这一点而无需跟随循环回到基R中(并避免拟合模型I不需要)?
But is it possible to do this without following back to loop in base R (and avoiding fitting models I don't need)?
我尝试过的事情是这样的:
What I tried is something like this:
by_continent <- by_continent %>%
left_join(tibble::enframe(formulae, name="continent", value="formula"))
by_continent %>%
mutate(model=map2(data, formula, est_model))
但我似乎无法提出一个有效的est_model函数。我尝试了此功能(h / t: https://gist.github.com/multidis/8138757)无效:
But I don't seem to be able to come up with an est_model function that works. I tried this function (h/t: https://gist.github.com/multidis/8138757) that doesn't work:
est_model <- function(data, formula, ...) {
mc <- match.call()
m <- match(c("formula","data"), names(mc), 0L)
mf <- mc[c(1L, m)]
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
data.st <- data.frame(mf)
return(data.st)
}
(不可否认,这是一个人为的例子。我的实际情况是,我有大量观测值缺失了数据中的关键独立变量,因此我想将一个模型与完整观测值中的所有变量拟合,将另一个模型拟合
(Admittedly, this is a contrived example. My actual case is that I have substantial observations missing key independent variables in my data, so I want to fit one model with all variables on complete observations and another with only a subset of the variables on the rest observations.)
UPDATE
我想出了一个有效的est_model函数(尽管可能效率不高):
I came up with an est_model function that works (though probably not efficient):
est_model <- function(data, formula, ...) {
map(list(data), formula, ...)[[1]]
}
by_continent <- by_continent %>%
mutate(model=map2(data, formula, est_model))
by_continent %>%
mutate(glance=map(model, glance)) %>%
unnest(glance, .drop=T)
## A tibble: 5 × 12
# continent r.squared adj.r.squared sigma statistic p.value df
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 Asia 0.4356350 0.4342026 8.9244419 304.1298 6.922751e-51 2
#2 Europe 0.4984677 0.4956580 3.8584819 177.4093 3.186760e-54 3
#3 Africa 0.4160797 0.4141991 7.0033542 221.2506 2.836552e-73 3
#4 Americas 0.9812082 0.9811453 8.9703814 15612.1901 4.227928e-260 1
#5 Oceania 0.9733268 0.9693258 0.6647653 243.2719 6.662577e-16 4
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## df.residual <int>
推荐答案
我发现创建模型列表更容易式。每个模型只适合对应的大陆
。我向嵌套数据中添加了新列 form c,以确保
公式
和
I find it is easier to make a list of model formula. each model was only fit once for the corresponding continent
. I add a new column formula
to the nested data to make sure that the formula
and the continent
are in the same order in case they are not.
formulae <- c(
Asia= lifeExp ~ year,
Europe= lifeExp ~ year + pop,
Africa= lifeExp ~ year + gdpPercap,
Americas= lifeExp ~ year - 1,
Oceania= lifeExp ~ year + pop + gdpPercap
)
df <- gapminder %>%
group_by(continent) %>%
nest() %>%
mutate(formula = formulae[as.character(continent)]) %>%
mutate(model = map2(formula, data, ~ lm(.x, .y))) %>%
mutate(glance=map(model, glance)) %>%
unnest(glance, .drop=T)
# # A tibble: 5 × 12
# continent r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 Asia 0.4356350 0.4342026 8.9244419 304.1298 6.922751e-51 2 -1427.65947 2861.31893 2873.26317
# 2 Europe 0.4984677 0.4956580 3.8584819 177.4093 3.186760e-54 3 -995.41016 1998.82033 2014.36475
# 3 Africa 0.4160797 0.4141991 7.0033542 221.2506 2.836552e-73 3 -2098.46089 4204.92179 4222.66639
# 4 Americas 0.9812082 0.9811453 8.9703814 15612.1901 4.227928e-260 1 -1083.35918 2170.71836 2178.12593
# 5 Oceania 0.9733268 0.9693258 0.6647653 243.2719 6.662577e-16 4 -22.06696 54.13392 60.02419
# # ... with 2 more variables: deviance <dbl>, df.residual <int>
这篇关于为列表列数据框的每一行拟合不同的模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!