重用 R 内置的模型 [英] Reusing a Model Built in R

查看:13
本文介绍了重用 R 内置的模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 R 中构建模型时,如何保存模型规范,以便可以在新数据上重复使用?假设我对历史数据进行了逻辑回归,但要到下个月才会有新的观察结果.最好的方法是什么?

When building a model in R, how do you save the model specifications such that you can reuse it on new data? Let's say I build a logistic regression on historical data but won't have new observations until next month. What's the best approach?

我考虑过的事情:

  • 保存模型对象并在新会话中加载
  • 我知道有些模型可以用 PMML 导出,但还没有真正看到任何关于导入 PMML 的内容

简单地说,当您需要在新会话中使用模型时,我想了解您的操作.

Simply, I am trying to get a sense of what you do when you need to use your model in a new session.

提前致谢.

推荐答案

重用模型来预测新观察

如果模型的计算成本不高,我倾向于在需要时重新运行的 R 脚本中记录整个模型构建过程.如果模型拟合中涉及到随机元素,我一定要设置一个已知的随机种子.

Reusing a model to predict for new observations

If the model is not computationally costly, I tend to document the entire model building process in an R script that I rerun when needed. If a random element is involved in the model fitting, I make sure to set a known random seed.

如果模型的计算成本很高,那么我仍然使用上述脚本,但使用 save() 将模型对象保存到 rda 对象中.然后我倾向于修改脚本,如果保存的对象存在,则加载它,或者如果不存在,则使用简单的 if()...else 子句包裹模型的相关部分来重新拟合模型代码.

If the model is computationally costly to compute, then I still use a script as above, but save out the model objects using save() into and rda object. I then tend to modify the script such that if the saved object exists, load it, or if not, refit the model, using a simple if()...else clause wrapped around the relevant parts of the code.

当加载你保存的模型对象时,一定要重新加载任何需要的包,尽管在你的情况下,如果 logit 模型通过 glm() 适合,将不会有任何额外的包加载到 R 之外.

When loading your saved model object, be sure to reload any required packages, although in your case if the logit model were fit via glm() there will not be any additional packages to load beyond R.

这是一个例子:

> set.seed(345)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> ## save this model
> save(m1, file = "my_model1.rda")
> 
> ## a month later, new observations are available: 
> newdf <- data.frame(x = rnorm(20))
> ## load the model
> load("my_model1.rda")
> ## predict for the new `x`s in `newdf`
> predict(m1, newdata = newdf)
        1         2         3         4         5         6 
6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255 
        7         8         9        10        11        12 
6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537 
       13        14        15        16        17        18 
2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407 
       19        20 
5.4247548 2.6906722 

如果想要自动执行此操作,那么我可能会在脚本中执行以下操作:

If wanting to automate this, then I would probably do the following in a script:

## data
df <- data.frame(x = rnorm(20))
df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))

## check if model exists? If not, refit:
if(file.exists("my_model1.rda")) {
    ## load model
    load("my_model1.rda")
} else {
    ## (re)fit the model
    m1 <- lm(y ~ x, data = df)
}

## predict for new observations
## new observations
newdf <- data.frame(x = rnorm(20))
## predict
predict(m1, newdata = newdf)

当然,数据生成代码将替换为加载实际数据的代码.

Of course, the data generation code would be replaced by code loading your actual data.

如果您想使用其他新观测值重新拟合模型.那么 update() 是一个有用的函数.它所做的只是用更新的一个或多个模型参数重新拟合模型.如果要在用于拟合模型的数据中包含新的观测值,请将新观测值添加到传递给参数 'data' 的数据框中,然后执行以下操作:

If you want to refit the model using additional new observations. Then update() is a useful function. All it does is refit the model with one or more of the model arguments updated. If you want to include new observations in the data used to fit the model, add the new observations to the data frame passed to argument 'data', and then do the following:

m2 <- update(m1, . ~ ., data = df)

其中 m1 是原始的、保存的模型拟合,.~ . 是模型公式的变化,在这种情况下意味着包括 ~ 左右两侧的所有现有变量(换句话说,不对模型公式进行任何更改),df 是用于拟合原始模型的数据框,扩展后包含新的可用观测值.

where m1 is the original, saved model fit, . ~ . is the model formula changes, which in this case means include all existing variables on both the left and right hand sides of ~ (in other words, make no changes to the model formula), and df is the data frame used to fit the original model, expanded to include the newly available observations.

这是一个工作示例:

> set.seed(123)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> m1

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.960        2.222  

> 
> ## new observations
> newdf <- data.frame(x = rnorm(20))
> newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20))
> ## add on to df
> df <- rbind(df, newdf)
> 
> ## update model fit
> m2 <- update(m1, . ~ ., data = df)
> m2

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.928        2.187

其他人在评论中提到了 formula(),它从拟合模型中提取公式:

Other have mentioned in comments formula(), which extracts the formula from a fitted model:

> formula(m1)
y ~ x
> ## which can be used to set-up a new model call
> ## so an alternative to update() above is:
> m3 <- lm(formula(m1), data = df)

但是,如果模型拟合涉及额外的参数,例如更复杂的模型拟合函数中的 'family''subset' 参数.如果 update() 方法可用于您的模型拟合函数(它们适用于许多常见的拟合函数,例如 glm()),它提供了一种更简单的方法来更新模型拟合而不是提取和重用模型公式.

However, if the model fitting involves additional arguments, like 'family', or 'subset' arguments in more complex model fitting functions. If update() methods are available for your model fitting function (which they are for many common fitting functions, like glm()), it provides a simpler way to update a model fit than extracting and reusing the model formula.

如果您打算在 R 中进行所有建模和未来预测,那么通过 PMML 或类似方法将模型抽象出来似乎没有多大意义.

If you intend to do all the modelling and future prediction in R, there doesn't really seem much point in abstracting the model out via PMML or similar.

这篇关于重用 R 内置的模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆