关于dplyr :: do vs purrr用法的建议:地图,整洁的::巢,用于预测 [英] advice on Usage of dplyr:: do vs purrr: map, tidy::nest, for predictions
问题描述
我刚遇到了purrr软件包,我认为这对我想做的事情有所帮助-我只是不能将它们放在一起。
我认为这将是后期的工作,但会讨论一个常见的用例,我认为还有很多其他用例,因此希望这对他们也有用。
这就是我的目标:
- 从一个大数据集中,对每个不同的子组运行多个模型。
- 这些模型随时可用,因此我可以检查-系数,准确性等。
- 从此保存的模型列表中为每个不同的组,能够将相应的模型应用于相应的测试集组。
grouping_vals = c( cyl, vs)
库(purrr)
库(dplyr)
set.seed(1)
train = mtcars
noise = sample(1:5,32,replace = TRUE)
test = mtcars%>%mutate(hp = hp * noise)#只是数据集不完全相同
模型=火车%&%;%
group_by_(grouping_vals)%>%
do(linear_model1 = lm(mpg〜hp,数据=。),
linear_model2 = lm(mpg〜。,data =。)
)
- 我已经走了这么远,但我不知道如何将对应的模型映射到 测试
- 现在,我可能还试图通过使用针对相应组的训练数据对linear_model1或linear_model2进行训练来获得残差。
models $ linear_model1 [[2]] $ residuals将为我显示model1第二分组的残差。我只是不知道如何将所有模型$ linear_model1 $ residuals转移到火车数据集中。
我的理解是,tidyr的nest()函数在做同样的事情当我创建模型的do()创建时发生的事情。
models_with_nest = train%>%
group_by_(grouping_vals)%>%
nest()%>%
mutate(linear_model2 = purrr :: map(data,〜lm(mpg〜。,data =。)),
linear_model1 = purrr :: map(data,〜lm(mpg〜hp + disp,data =。))
)
再次寻找一种方法,可以轻松地将这些残差/训练预测映射到训练数据集,然后将其应用到一个看不见的测试数据集,如我上面创建的那样。
我希望这不会造成混淆,因为我在这里看到了很多希望,但我只是想不出如何将其组合在一起。
我认为这是很多人希望能够以更自动化的方式完成的任务,而是人们做得很慢并且
我真的很想找出 do $之间的区别c $ c>和
嵌套,地图
方法。也许人们都尝试了这两种方法,他们可以在处理更大的数据集或更多模型时发表评论。
到目前为止,我一直在使用 do
方法如下:
library(tidyverse)
#可再现的结果
set.seed(47)
#随机播放/随机行
mtcars2 = mtcars%&%;%sample_frac(1)
#分割火车/测试
mtcars_train = mtcars2 [1:20,]
mtcars_test = mtcars2 [21:32,]
#为每个圆柱组创建子集并拟合使用do
dt_models = mtcars_train%>%
group_by(cyl)%>%
do(model1 = lm(disp〜hp,data =。),
model2 = lm(disp〜mpg,data =。))%>%
取消分组%>%
print()
#重塑模型数据集(以便于使用稍后)
dt_models = dt_models%&%;%collect( name, model,-cyl)%&%;%print()
#选择模型并预测相应数据的函数(行)
GetMode lAndPredict = function(input_cyl,model_name,dd){
m =(dt_models%>%filter(cyl == input_cyl& name == model_name))$ model [[1]]
预报.lm(m,newdata = dd)
}
#预报每行使用相应的模型
mtcars_test%&%;%
rowwise()%&%;%
do(data.frame(。,
pred1 = GetModelAndPredict(。$ cyl, model1,。),
pred2 = GetModelAndPredict(。$ cyl, model2,。)))%>%
取消分组
##小食:12×13
#mpg cyl disp hp drat wt qsec vs am gear碳水化合物pred1 pred2
#*< dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl>
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 103.11501 115.24903
#2 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 356.19839 316.20091
#3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 200.10912 151.56750
#4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 195.69767 198.89904
#5 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 87.99347 77.54320
# 6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 101.99490 102.68042
#7 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 365.97745 339.57501
#8 24.4 4 146.7 62 3.69 3.190 20.00 10.00 4 2 85.75324 108.96473
#9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 87.99347 97.57442
#10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 87.43341 71.65166
#11 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 104.23513 115.24903
#12 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 355.61630 294.38507
但是我发现嵌套地图
的方法也非常有趣:
library(tidyverse)
#可再现的结果
set.seed(47)
#随机播放/随机行
mtcars2 = mtcars%>%sample_frac(1)
#分割火车/测试
mtcars_train = mtcars2 [1:20,]
mtcars_test = mtcars2 [21:32,] $ b $ c
$ b#为每个cyl组创建子集并使用映射
拟合感兴趣的模型
dt_models = mtcars_train%>%
nest(-cyl)%>%
突变(model1 = map(data,〜lm(disp〜hp,data =。)),
model2 = map(data,〜lm(disp〜mpg,data =。)))%>%
重命名(data_train =数据)%>%
print()
#加入测试数据以能够预测它们
dt_models_and_test_data = mtcars_test%>%
巢(-cyl) %>%
inner_join(dt_models,by = cyl)%>%
重命名(data_test =数据)%>%
print()
#使用map2预测测试数据
dt_preds = dt_models_and_test_data%&%;%
mutate(pred1 = map2(model1,data_test,Forecast.lm),
pred2 = map2(model2,data_test,Forecast .lm))%>%
print()
#使用嵌套在感兴趣列
上返回合理的数据帧
dt_preds_upd = dt_preds%>%
unnest(data_test,pred1,pred2)%&%;%
print()
##动作:12×13
#cyl pred1 pred2 mpg disp hp drat wt qsec vs am gear碳水化合物
#< dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl>
#1 4 103.11501 115.24903 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 4 87.99347 77.54320 32.4 78.7 66 4.08 2.200 19.47 1 1 4 1
#3 4 101.99490 102.68042 26.0 120.3 91 4.43 2.140 16.70 0 1 5 2
#4 4 85.75324 108.96473 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
#5 4 87.99347 97.57442 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
# 6 4 87.43341 71.65166 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1
#7 4 104.23513 115.24903 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2
#8 8 356.19839 316.20091 17.3 275.8 180 3.07 3.730 17.60 0 0 3 3
#9 8 365.97745 339.57501 15.8 351.0 264 4.22 3.170 14.50 0 1 5 4
#10 8 355.61630 294.38507 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
#11 6 200.10912 151.56750 18 。 1 225.0 105 2.76 3.460 20.22 1 0 3 1
#12 6 195.69767 198.89904 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4
I just came across the the purrr package and I think this would help me out a bit in terms of what I want to do - I just can't put it together.
I think this is going to be along post but goes over a common use case I think many others run into so hopefully this is of use to them as well.
This is what I'm aiming for:
- From one big dataset run multiple models on each of the different subgroups.
- Have these models readily available so I can examine - for coeffients, accuracy, etc.
- From this saved model list for each of the different groupings, be able to apply the corresponding model to the corresponding test-set group.
grouping_vals = c("cyl", "vs") library(purrr) library(dplyr) set.seed(1) train=mtcars noise = sample(1:5,32, replace=TRUE) test = mtcars %>% mutate( hp = hp * noise) # just so dataset isn't identical models = train %>% group_by_(grouping_vals) %>% do(linear_model1 = lm(mpg ~hp, data=.), linear_model2 = lm(mpg ~., data=.) )
- I've gotten this far but I don't know how to 'map' the corresponding models to the "test" dataset for the corresponding grouped values.
- Now I also might be trying to get the residuals from the training of the linear_model1 or linear_model2 with the training-data for the corresponding groups.
models$linear_model1[[2]]$residuals will show me the residuals for the 2nd grouping of model1. I just don't know how move say all of models$linear_model1 $residuals over to the train dataset.
My understanding is that tidyr's nest() function is doing the same thing that occurs when I create my do() create of the models.
models_with_nest = train %>%
group_by_(grouping_vals) %>%
nest() %>%
mutate( linear_model2 = purrr::map(data, ~lm(mpg~., data=.)),
linear_model1 = purrr::map(data, ~lm(mpg~ hp+disp, data=.))
)
Again just look for a way to easily be able to 'map' these residuals/training predictions to the training dataset and apply then apply the corresponding model to an unseen test dataset like the one I created above.
I hope this isn't confusing since I see a lot of promise here I just can't figure out how to put it together.
I figure this is a task that a ton of people would like to be able to do in this more 'automated' way but instead is something that people do very slowly and step by step.
I'm really interested in finding out differences between the do
and the nest, map
approaches. Maybe people have tried both and they can comment in which is faster when dealing with much bigger datasets, or much more models.
So far I've been using the do
approach as follows:
library(tidyverse)
# reproducible results
set.seed(47)
# shuffle / randomise rows
mtcars2 = mtcars %>% sample_frac(1)
# split train / test
mtcars_train = mtcars2[1:20,]
mtcars_test = mtcars2[21:32,]
# for each cyl group create subsets and fit the models of interest using do
dt_models = mtcars_train %>%
group_by(cyl) %>%
do(model1 = lm(disp ~ hp, data = .),
model2 = lm(disp ~ mpg, data = .)) %>%
ungroup %>%
print()
# reshape model dataset (for easier use later)
dt_models = dt_models %>% gather("name","model", -cyl) %>% print()
# function to pick model and predict corresponding data (row)
GetModelAndPredict = function(input_cyl, model_name, dd){
m = (dt_models %>% filter(cyl==input_cyl & name==model_name))$model[[1]]
predict.lm(m, newdata=dd)
}
# predict each row using the corresponding model
mtcars_test %>%
rowwise() %>%
do(data.frame(.,
pred1 = GetModelAndPredict(.$cyl, "model1", .),
pred2 = GetModelAndPredict(.$cyl, "model2", .))) %>%
ungroup
# # A tibble: 12 × 13
# mpg cyl disp hp drat wt qsec vs am gear carb pred1 pred2
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 103.11501 115.24903
# 2 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 356.19839 316.20091
# 3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 200.10912 151.56750
# 4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 195.69767 198.89904
# 5 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 87.99347 77.54320
# 6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 101.99490 102.68042
# 7 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 365.97745 339.57501
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 85.75324 108.96473
# 9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 87.99347 97.57442
# 10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 87.43341 71.65166
# 11 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 104.23513 115.24903
# 12 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 355.61630 294.38507
But I found really interesting the nest, map
approach as well:
library(tidyverse)
# reproducible results
set.seed(47)
# shuffle / randomise rows
mtcars2 = mtcars %>% sample_frac(1)
# split train / test
mtcars_train = mtcars2[1:20,]
mtcars_test = mtcars2[21:32,]
# for each cyl group create subsets and fit the models of interest using map
dt_models = mtcars_train %>%
nest(-cyl) %>%
mutate(model1 = map(data, ~lm(disp ~ hp, data = .)),
model2 = map(data, ~lm(disp ~ mpg, data = .))) %>%
rename(data_train = data) %>%
print()
# join test data to be able to predict them
dt_models_and_test_data = mtcars_test %>%
nest(-cyl) %>%
inner_join(dt_models, by = "cyl") %>%
rename(data_test = data) %>%
print()
# predict test data using map2
dt_preds = dt_models_and_test_data %>%
mutate(pred1 = map2(model1, data_test, predict.lm),
pred2 = map2(model2, data_test, predict.lm)) %>%
print()
# go back to a reasonable data frame using unnest on columns of interest
dt_preds_upd = dt_preds %>%
unnest(data_test,pred1,pred2) %>%
print()
# # A tibble: 12 × 13
# cyl pred1 pred2 mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 103.11501 115.24903 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
# 2 4 87.99347 77.54320 32.4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 3 4 101.99490 102.68042 26.0 120.3 91 4.43 2.140 16.70 0 1 5 2
# 4 4 85.75324 108.96473 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 5 4 87.99347 97.57442 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
# 6 4 87.43341 71.65166 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1
# 7 4 104.23513 115.24903 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2
# 8 8 356.19839 316.20091 17.3 275.8 180 3.07 3.730 17.60 0 0 3 3
# 9 8 365.97745 339.57501 15.8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 10 8 355.61630 294.38507 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
# 11 6 200.10912 151.56750 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1
# 12 6 195.69767 198.89904 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4
这篇关于关于dplyr :: do vs purrr用法的建议:地图,整洁的::巢,用于预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!