关于dplyr :: do vs purrr用法的建议:地图,整洁的::巢,用于预测 [英] advice on Usage of dplyr:: do vs purrr: map, tidy::nest, for predictions

查看:59
本文介绍了关于dplyr :: do vs purrr用法的建议:地图,整洁的::巢,用于预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚遇到了purrr软件包,我认为这对我想做的事情有所帮助-我只是不能将它们放在一起。



我认为这将是后期的工作,但会讨论一个常见的用例,我认为还有很多其他用例,因此希望这对他们也有用。



这就是我的目标:


  1. 从一个大数据集中,对每个不同的子组运行多个模型。

  2. 这些模型随时可用,因此我可以检查-系数,准确性等。

  3. 从此保存的模型列表中为每个不同的组,能够将相应的模型应用于相应的测试集组。




  grouping_vals = c( cyl, vs)

库(purrr)
库(dplyr)
set.seed(1)
train = mtcars
noise = sample(1:5,32,replace = TRUE)
test = mtcars%>%mutate(hp = hp * noise)#只是数据集不完全相同


模型=火车%&%;%
group_by_(grouping_vals)%>%
do(linear_model1 = lm(mpg〜hp,数据=。),
linear_model2 = lm(mpg〜。,data =。)





  1. 我已经走了这么远,但我不知道如何将对应的模型映射到 测试

  2. 现在,我可能还试图通过使用针对相应组的训练数据对linear_model1或linear_model2进行训练来获得残差。

models $ linear_model1 [[2]] $ residuals将为我显示model1第二分组的残差。我只是不知道如何将所有模型$ linear_model1 $ residuals转移到火车数据集中。



我的理解是,tidyr的nest()函数在做同样的事情当我创建模型的do()创建时发生的事情。

  models_with_nest = train%>%
group_by_(grouping_vals)%>%
nest()%>%
mutate(linear_model2 = purrr :: map(data,〜lm(mpg〜。,data =。)),
linear_model1 = purrr :: map(data,〜lm(mpg〜hp + disp,data =。))

再次寻找一种方法,可以轻松地将这些残差/训练预测映射到训练数据集,然后将其应用到一个看不见的测试数据集,如我上面创建的那样。



我希望这不会造成混淆,因为我在这里看到了很多希望,但我只是想不出如何将其组合在一起。



我认为这是很多人希望能够以更自动化的方式完成的任务,而是人们做得很慢并且

解决方案

我真的很想找出 do 嵌套,地图方法。也许人们都尝试了这两种方法,他们可以在处理更大的数据集或更多模型时发表评论。



到目前为止,我一直在使用 do 方法如下:

  library(tidyverse)

#可再现的结果
set.seed(47)

#随机播放/随机行
mtcars2 = mtcars%&%;%sample_frac(1)

#分割火车/测试
mtcars_train = mtcars2 [1:20,]
mtcars_test = mtcars2 [21:32,]

#为每个圆柱组创建子集并拟合使用do
dt_models = mtcars_train%>%
group_by(cyl)%>%
do(model1 = lm(disp〜hp,data =。),
model2 = lm(disp〜mpg,data =。))%>%
取消分组%>%
print()

#重塑模型数据集(以便于使用稍后)
dt_models = dt_models%&%;%collect( name, model,-cyl)%&%;%print()

#选择模型并预测相应数据的函数(行)
GetMode lAndPredict = function(input_cyl,model_name,dd){

m =(dt_models%>%filter(cyl == input_cyl& name == model_name))$ model [[1]]

预报.lm(m,newdata = dd)

}

#预报每行使用相应的模型
mtcars_test%&%;%
rowwise()%&%;%
do(data.frame(。,
pred1 = GetModelAndPredict(。$ cyl, model1,。),
pred2 = GetModelAndPredict(。$ cyl, model2,。)))%>%
取消分组


##小食:12×13
#mpg cyl disp hp drat wt qsec vs am gear碳水化合物pred1 pred2
#*< dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl>
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 103.11501 115.24903
#2 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 356.19839 316.20091
#3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 200.10912 151.56750
#4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 195.69767 198.89904
#5 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 87.99347 77.54320
# 6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 101.99490 102.68042
#7 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 365.97745 339.57501
#8 24.4 4 146.7 62 3.69 3.190 20.00 10.00 4 2 85.75324 108.96473
#9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 87.99347 97.57442
#10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 87.43341 71.65166
#11 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 104.23513 115.24903
#12 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 355.61630 294.38507

但是我发现嵌套地图的方法也非常有趣:

  library(tidyverse)

#可再现的结果
set.seed(47)

#随机播放/随机行
mtcars2 = mtcars%>%sample_frac(1)

#分割火车/测试
mtcars_train = mtcars2 [1:20,]
mtcars_test = mtcars2 [21:32,] $ b $ c
$ b#为每个cyl组创建子集并使用映射
拟合感兴趣的模型
dt_models = mtcars_train%>%
nest(-cyl)%>%
突变(model1 = map(data,〜lm(disp〜hp,data =。)),
model2 = map(data,〜lm(disp〜mpg,data =。)))%>%
重命名(data_train =数据)%>%
print()

#加入测试数据以能够预测它们
dt_models_and_test_data = mtcars_test%>%
巢(-cyl) %>%
inner_join(dt_models,by = cyl)%>%
重命名(data_test =数据)%>%
print()

#使用map2预测测试数据
dt_preds = dt_models_and_test_data%&%;%
mutate(pred1 = map2(model1,data_test,Forecast.lm),
pred2 = map2(model2,data_test,Forecast .lm))%>%
print()

#使用嵌套在感兴趣列
上返回合理的数据帧
dt_preds_upd = dt_preds%>%
unnest(data_test,pred1,pred2)%&%;%
print()


##动作:12×13
#cyl pred1 pred2 mpg disp hp drat wt qsec vs am gear碳水化合物
#< dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl> < dbl>
#1 4 103.11501 115.24903 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 4 87.99347 77.54320 32.4 78.7 66 4.08 2.200 19.47 1 1 4 1
#3 4 101.99490 102.68042 26.0 120.3 91 4.43 2.140 16.70 0 1 5 2
#4 4 85.75324 108.96473 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
#5 4 87.99347 97.57442 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
# 6 4 87.43341 71.65166 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1
#7 4 104.23513 115.24903 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2
#8 8 356.19839 316.20091 17.3 275.8 180 3.07 3.730 17.60 0 0 3 3
#9 8 365.97745 339.57501 15.8 351.0 264 4.22 3.170 14.50 0 1 5 4
#10 8 355.61630 294.38507 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
#11 6 200.10912 151.56750 18 。 1 225.0 105 2.76 3.460 20.22 1 0 3 1
#12 6 195.69767 198.89904 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4


I just came across the the purrr package and I think this would help me out a bit in terms of what I want to do - I just can't put it together.

I think this is going to be along post but goes over a common use case I think many others run into so hopefully this is of use to them as well.

This is what I'm aiming for:

  1. From one big dataset run multiple models on each of the different subgroups.
  2. Have these models readily available so I can examine - for coeffients, accuracy, etc.
  3. From this saved model list for each of the different groupings, be able to apply the corresponding model to the corresponding test-set group.

grouping_vals = c("cyl", "vs")

library(purrr)
library(dplyr)
set.seed(1)
train=mtcars
noise = sample(1:5,32, replace=TRUE) 
test = mtcars %>% mutate( hp = hp * noise) # just so dataset isn't identical


models = train %>% 
group_by_(grouping_vals) %>%
do(linear_model1 = lm(mpg ~hp, data=.),
   linear_model2 = lm(mpg ~., data=.)
)

  1. I've gotten this far but I don't know how to 'map' the corresponding models to the "test" dataset for the corresponding grouped values.
  2. Now I also might be trying to get the residuals from the training of the linear_model1 or linear_model2 with the training-data for the corresponding groups.

models$linear_model1[[2]]$residuals will show me the residuals for the 2nd grouping of model1. I just don't know how move say all of models$linear_model1 $residuals over to the train dataset.

My understanding is that tidyr's nest() function is doing the same thing that occurs when I create my do() create of the models.

   models_with_nest =  train %>% 
     group_by_(grouping_vals) %>%
                   nest() %>%
     mutate( linear_model2 = purrr::map(data, ~lm(mpg~., data=.)),
             linear_model1 = purrr::map(data, ~lm(mpg~ hp+disp, data=.))
     )

Again just look for a way to easily be able to 'map' these residuals/training predictions to the training dataset and apply then apply the corresponding model to an unseen test dataset like the one I created above.

I hope this isn't confusing since I see a lot of promise here I just can't figure out how to put it together.

I figure this is a task that a ton of people would like to be able to do in this more 'automated' way but instead is something that people do very slowly and step by step.

解决方案

I'm really interested in finding out differences between the do and the nest, map approaches. Maybe people have tried both and they can comment in which is faster when dealing with much bigger datasets, or much more models.

So far I've been using the do approach as follows:

library(tidyverse)

# reproducible results
set.seed(47)

# shuffle / randomise rows
mtcars2 = mtcars %>% sample_frac(1)

# split train / test
mtcars_train = mtcars2[1:20,]
mtcars_test = mtcars2[21:32,]

# for each cyl group create subsets and fit the models of interest using do
dt_models = mtcars_train %>%
  group_by(cyl) %>%
  do(model1 = lm(disp ~ hp, data = .),
     model2 = lm(disp ~ mpg, data = .)) %>%
  ungroup %>%
  print()

# reshape model dataset (for easier use later)
dt_models = dt_models %>% gather("name","model", -cyl) %>% print()

# function to pick model and predict corresponding data (row)
GetModelAndPredict = function(input_cyl, model_name, dd){

  m = (dt_models %>% filter(cyl==input_cyl & name==model_name))$model[[1]]

  predict.lm(m, newdata=dd)

}

# predict each row using the corresponding model
mtcars_test %>%
  rowwise() %>%
  do(data.frame(.,
                pred1 = GetModelAndPredict(.$cyl, "model1", .),
                pred2 = GetModelAndPredict(.$cyl, "model2", .))) %>%
  ungroup


# # A tibble: 12 × 13
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     pred1     pred2
# *  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
# 1   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 103.11501 115.24903
# 2   17.3     8 275.8   180  3.07 3.730 17.60     0     0     3     3 356.19839 316.20091
# 3   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 200.10912 151.56750
# 4   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 195.69767 198.89904
# 5   32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1  87.99347  77.54320
# 6   26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2 101.99490 102.68042
# 7   15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4 365.97745 339.57501
# 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2  85.75324 108.96473
# 9   27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1  87.99347  97.57442
# 10  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1  87.43341  71.65166
# 11  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 104.23513 115.24903
# 12  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 355.61630 294.38507

But I found really interesting the nest, map approach as well:

library(tidyverse)

# reproducible results
set.seed(47)

# shuffle / randomise rows
mtcars2 = mtcars %>% sample_frac(1)

# split train / test
mtcars_train = mtcars2[1:20,]
mtcars_test = mtcars2[21:32,]

# for each cyl group create subsets and fit the models of interest using map
dt_models = mtcars_train %>%
  nest(-cyl) %>%
  mutate(model1 = map(data, ~lm(disp ~ hp, data = .)),
         model2 = map(data, ~lm(disp ~ mpg, data = .))) %>%
  rename(data_train = data) %>%
  print()

# join test data to be able to predict them
dt_models_and_test_data = mtcars_test %>%
  nest(-cyl) %>%
  inner_join(dt_models, by = "cyl") %>%
  rename(data_test = data) %>%
  print()

# predict test data using map2
dt_preds = dt_models_and_test_data %>%
  mutate(pred1 = map2(model1, data_test, predict.lm),
         pred2 = map2(model2, data_test, predict.lm)) %>%
  print()

# go back to a reasonable data frame using unnest on columns of interest
dt_preds_upd = dt_preds %>%
  unnest(data_test,pred1,pred2) %>%
  print()


# # A tibble: 12 × 13
#      cyl     pred1     pred2   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl>     <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1      4 103.11501 115.24903  22.8 108.0    93  3.85 2.320 18.61     1     1     4     1
# 2      4  87.99347  77.54320  32.4  78.7    66  4.08 2.200 19.47     1     1     4     1
# 3      4 101.99490 102.68042  26.0 120.3    91  4.43 2.140 16.70     0     1     5     2
# 4      4  85.75324 108.96473  24.4 146.7    62  3.69 3.190 20.00     1     0     4     2
# 5      4  87.99347  97.57442  27.3  79.0    66  4.08 1.935 18.90     1     1     4     1
# 6      4  87.43341  71.65166  33.9  71.1    65  4.22 1.835 19.90     1     1     4     1
# 7      4 104.23513 115.24903  22.8 140.8    95  3.92 3.150 22.90     1     0     4     2
# 8      8 356.19839 316.20091  17.3 275.8   180  3.07 3.730 17.60     0     0     3     3
# 9      8 365.97745 339.57501  15.8 351.0   264  4.22 3.170 14.50     0     1     5     4
# 10     8 355.61630 294.38507  18.7 360.0   175  3.15 3.440 17.02     0     0     3     2
# 11     6 200.10912 151.56750  18.1 225.0   105  2.76 3.460 20.22     1     0     3     1
# 12     6 195.69767 198.89904  21.0 160.0   110  3.90 2.875 17.02     0     1     4     4

这篇关于关于dplyr :: do vs purrr用法的建议:地图,整洁的::巢,用于预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆