如何使用扫帚和dplyr将分组数据应用于分组模型? [英] How can I apply grouped data to grouped models using broom and dplyr?

查看:247
本文介绍了如何使用扫帚和dplyr将分组数据应用于分组模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在mtcars数据集中将gpm(加仑/英里= 1 / mpg)的模型拟合为wt。这很容易:

I'd like to do the equivalent of fitting a model of gpm (gallons per mile = 1/mpg) to wt in the mtcars data set. That seems easy:

data(mtcars)
library(dplyr)
library(tidyr)
library(broom)
library(ggplot2)
library(scales)

mtcars2 <-
    mtcars %>%
    mutate(gpm = 1 / mpg) %>%
    group_by(cyl, am)

lm1 <-
    mtcars2 %>%
    do(fit = lm(gpm ~ wt, data = .))

这样可以得到6行的横向数据帧,如

That gets me a rowwise data frame with 6 rows, as expected.

此图确认有六组:

p1 <-
    qplot(wt, gpm, data = mtcars2) +
    facet_grid(cyl ~ am) +
    stat_smooth(method='lm',se=FALSE, fullrange = TRUE) +
    scale_x_continuous(limits = c(0,NA)) 

我可以使用)得到拟合输出:

I can use augment() to get the fitted outputs:

lm1 %>% augment(fit)

正如预期的那样,它给了我32行,每行一行在mtcars2中。

That gives me 32 rows, one for each row in mtcars2, as expected.

现在面临挑战:我想使用newdata获得适合的输出,在那里我通过cyl / 4增加了wt:

Now the challenge: I'd like to get fitted outputs using newdata, where I've incremented wt by cyl/4:

newdata <-
    mtcars2 %>%
    mutate(
        wt = wt + cyl/4)

我希望这会产生与lm1%>%augment(fit)相同大小的数据框:一行对于newdata中的每一行,因为扫帚将通过分组变量cyl和am匹配模型和newdata。

I expect that this will produce a data frame of the same size as lm1 %>% augment(fit): one row for each row in newdata, because broom will match up models and newdata by the grouping variables cyl and am.

不幸的是,

pred1 <-
    lm1 %>%
    augment(
        fit,
        newdata = newdata)

给我一​​个192行(= 6 x 32)的数据框,显然每个模型都适合每一行的新数据。

gives me a data frame with 192 rows (= 6 x 32), apparently fitting each model to each row of newdata.

从其他地方阅读,我收集到group_by和rowwise数据帧是不兼容的,所以lm1是未分组的,而增加不能关联模型和newdata。还有另一种设计模式让我做到这一点吗?如果它像上面的尝试一样简单和透明,那将是很好的,但它更重要的是它的工作。

From reading elsewhere, I gather that group_by and rowwise data frames aren't compatible, so lm1 is ungrouped, and augment can't associate models and newdata. Is there another design pattern that lets me do this? It would be nice if it were as simple and transparent as the above attempt, but it's more important that it work.

这是我的sessionInfo():

Here's my sessionInfo():

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] scales_0.4.0  ggplot2_2.1.0 broom_0.4.1   tidyr_0.6.0   dplyr_0.5.0  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7      magrittr_1.5     mnormt_1.5-4     munsell_0.4.3   
 [5] colorspace_1.2-6 lattice_0.20-34  R6_2.1.3         stringr_1.1.0   
 [9] plyr_1.8.4       tools_3.3.1      parallel_3.3.1   grid_3.3.1      
[13] nlme_3.1-128     gtable_0.2.0     psych_1.6.9      DBI_0.5-1       
[17] lazyeval_0.2.0   assertthat_0.1   tibble_1.2       reshape2_1.4.1  
[21] labeling_0.3     stringi_1.1.1    compiler_3.3.1   foreign_0.8-67  

编辑:

@aosmith:我一直在探索你的第二个选项,我喜欢它。当我尝试我的真实数据,但是,我在mutate命令中有一个问题:它返回错误:增加不知道如何处理类列表的数据。

@aosmith: I have been exploring your second option, and I like it. When I try it on my real data, though, I have a problem in the mutate command: it returns "Error: augment doesn't know how to deal with data of class list".

我的真实代码更像:

newdata %>% 
dplyr::select(cyl, am, wt) %>% # wt holds new predictor values
group_by(cyl, am) %>%
nest() %>%
inner_join(regressions, .) %>% 
## looks like yours at this point
mutate(pred = list(augment(fit, newdata = data))) %>% # Error here
unnest(pred)

在哪里我看起来像你的,我意思是我有以下列(为了一致性而重命名):ID(chr),attr1(dbl),cyl(dbl),am(chr),fit(list)和data(list)。你有cyl,am(dbl),fit和data。我改变了我的dbl,但这没有帮助。

Where I say it looks like yours, I mean I have the following columns (renamed here for consistency): ID (chr), attr1 (dbl), cyl (dbl), am (chr), fit (list), and data (list). You have cyl, am (dbl), fit, and data. I changed my am to dbl, but that didn't help.

我认为区别是我有3(ID ...类似于mtcars中的rownames)该样品中x 2(cyl)x 2(am)单位(每个样品具有12个测量),而mtcars示例具有3(cyl)x 2(am)个细胞,每个细胞的随机数量的汽车类型。在我的分析中,我需要看到ID值,但newdata同样适用于所有单元。如果有帮助,将其视为在测试中应用于每辆车的逆风速度。这是否表明增加投诉的原因是无法处理类列表的数据?

I think the difference is that I have 3 (ID ... similar to the rownames in mtcars) x 2 (cyl) x 2 (am) units in this sample (with each sample having 12 measurements), while the mtcars example has 3 (cyl) x 2 (am) cells x a random number of car types per cell. In my analysis, I need to see the ID values, but newdata applies equally to all units. If it helps, think of it as the speed of a headwind applied to each car in the test. Does that suggest a cause for augment's complaint it can't deal with data of class list?

编辑:将ID与newdata(使用full = TRUE)合并解决了最后一个问题。我正在使用您的第一个提出的解决方案。

Merging the ID with the newdata (using full=TRUE) solved the last problem. I'm currently using your first proposed solution.

推荐答案

我已经使用 map2 从包 purrr 这种情况。 map2 同时循环遍历两个列表的元素。列表必须是相同的长度,并且按照相同的顺序。

I've used map2 from package purrr for this sort of situation. map2 loops through the elements of two lists simultaneously. The lists must be the same length and be in the same order.

列表的元素被用作要应用的某些函数的参数( augment ,在你的情况下)。这里您的两个列表将是模型列表和数据集列表(每个 cyl / am 组合的列表)

The elements of the lists are used as arguments for some function you want to apply (augment, in your case). Here your two lists would be a list of models and a list of datasets (one list for each cyl/am combination).

使用 map2_df 将结果作为数据框而不是列表返回。

Using map2_df returns the results as a data.frame instead of a list.

library(purrr)

我使用 split 来制作data.frames的列表。要分解的因素的顺序确定了列表顺序,所以我确定它与 lm1 的顺序相同。

I made the list of data.frames to predict with using split. The order of the factors to split on determined the list order, so I made sure it was in the same order as lm1.

test_split = split(newdata, list(newdata$am, newdata$cyl)

map2_df(lm1$fit, test_split, ~augment(.x, newdata = .y))

关于订单这么多,您可以按组分组预测数据嵌套,将其加入到 lm1 中,并返回 $ <$ code>作为不明的列表。

To avoid worrying about order so much, you could nest the prediction data by groups, join this to lm1, and return the results of augment as a list for unnesting.

newdata %>%
    group_by(cyl, am) %>%
    nest() %>%
    inner_join(lm1, .) %>%
    mutate(pred = list(augment(fit, newdata = data))) %>%
    unnest(pred)

这篇关于如何使用扫帚和dplyr将分组数据应用于分组模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆