重用R中内置的模型 [英] Reusing a Model Built in R

查看:58
本文介绍了重用R中内置的模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中构建模型时,如何保存模型规格,以便可以在新数据上重用它?假设我在历史数据的基础上进行了逻辑回归,但是直到下个月才有新的观察结果.最好的方法是什么?

When building a model in R, how do you save the model specifications such that you can reuse it on new data? Let's say I build a logistic regression on historical data but won't have new observations until next month. What's the best approach?

我考虑过的事情:

  • 保存模型对象并在新会话中加载
  • 我知道某些模型可以使用PMML导出,但是对于导入PMML并没有真正的了解

简而言之,当您需要在新的会话中使用模型时,我试图了解您的工作.

Simply, I am trying to get a sense of what you do when you need to use your model in a new session.

谢谢.

推荐答案

重新使用模型来预测新观测值

如果模型的计算成本不高,我倾向于在需要时重新运行的R脚本中记录整个模型构建过程.如果模型拟合中涉及随机元素,请确保设置一个已知的随机种子.

Reusing a model to predict for new observations

If the model is not computationally costly, I tend to document the entire model building process in an R script that I rerun when needed. If a random element is involved in the model fitting, I make sure to set a known random seed.

如果模型的计算成本很高,那么我仍然使用上面的脚本,但是使用save()将模型对象保存到rda对象中.然后,我倾向于修改脚本,以便使用保存在代码相关部分中的简单if()...else子句来修改是否存在已保存的对象(如果不存在,请重新加载模型).

If the model is computationally costly to compute, then I still use a script as above, but save out the model objects using save() into and rda object. I then tend to modify the script such that if the saved object exists, load it, or if not, refit the model, using a simple if()...else clause wrapped around the relevant parts of the code.

在加载保存的模型对象时,请确保重新加载所有必需的软件包,尽管在这种情况下,如果通过glm()安装了logit模型,则除了R以外,没有其他软件包可以加载.

When loading your saved model object, be sure to reload any required packages, although in your case if the logit model were fit via glm() there will not be any additional packages to load beyond R.

这里是一个例子:

> set.seed(345)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> ## save this model
> save(m1, file = "my_model1.rda")
> 
> ## a month later, new observations are available: 
> newdf <- data.frame(x = rnorm(20))
> ## load the model
> load("my_model1.rda")
> ## predict for the new `x`s in `newdf`
> predict(m1, newdata = newdf)
        1         2         3         4         5         6 
6.1370366 6.5631503 2.9808845 5.2464261 4.6651015 3.4475255 
        7         8         9        10        11        12 
6.7961764 5.3592901 3.3691800 9.2506653 4.7562096 3.9067537 
       13        14        15        16        17        18 
2.0423691 2.4764664 3.7308918 6.9999064 2.0081902 0.3256407 
       19        20 
5.4247548 2.6906722 

如果要自动执行此操作,那么我可能会在脚本中执行以下操作:

If wanting to automate this, then I would probably do the following in a script:

## data
df <- data.frame(x = rnorm(20))
df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))

## check if model exists? If not, refit:
if(file.exists("my_model1.rda")) {
    ## load model
    load("my_model1.rda")
} else {
    ## (re)fit the model
    m1 <- lm(y ~ x, data = df)
}

## predict for new observations
## new observations
newdf <- data.frame(x = rnorm(20))
## predict
predict(m1, newdata = newdf)

当然,数据生成代码将替换为加载实际数据的代码.

Of course, the data generation code would be replaced by code loading your actual data.

如果要使用其他新观测值来拟合模型.那么update()是有用的功能.它所做的只是用更新的一个或多个模型参数来调整模型.如果要在用于拟合模型的数据中包括新观察值,请将新观察值添加到传递给参数'data'的数据框中,然后执行以下操作:

If you want to refit the model using additional new observations. Then update() is a useful function. All it does is refit the model with one or more of the model arguments updated. If you want to include new observations in the data used to fit the model, add the new observations to the data frame passed to argument 'data', and then do the following:

m2 <- update(m1, . ~ ., data = df)

其中m1是原始的已保存模型拟合,. ~ .是模型公式更改,在这种情况下,这意味着在~的左侧和右侧都包括所有现有变量(换句话说,无需更改模型公式),df是用于适应原始模型的数据框,已扩展为包括新获得的观测值.

where m1 is the original, saved model fit, . ~ . is the model formula changes, which in this case means include all existing variables on both the left and right hand sides of ~ (in other words, make no changes to the model formula), and df is the data frame used to fit the original model, expanded to include the newly available observations.

这是一个有效的示例:

> set.seed(123)
> df <- data.frame(x = rnorm(20))
> df <- transform(df, y = 5 + (2.3 * x) + rnorm(20))
> ## model
> m1 <- lm(y ~ x, data = df)
> m1

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.960        2.222  

> 
> ## new observations
> newdf <- data.frame(x = rnorm(20))
> newdf <- transform(newdf, y = 5 + (2.3 * x) + rnorm(20))
> ## add on to df
> df <- rbind(df, newdf)
> 
> ## update model fit
> m2 <- update(m1, . ~ ., data = df)
> m2

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      4.928        2.187

其他人在注释formula()中提到,该注释从拟合模型中提取公式:

Other have mentioned in comments formula(), which extracts the formula from a fitted model:

> formula(m1)
y ~ x
> ## which can be used to set-up a new model call
> ## so an alternative to update() above is:
> m3 <- lm(formula(m1), data = df)

但是,如果模型拟合涉及其他参数,例如更复杂的模型拟合函数中的'family''subset'参数.如果update()方法可用于您的模型拟合函数(对于许多常见的拟合函数,例如glm()),则它提供了一种比提取和重新使用模型公式更简单的方法来更新模型拟合.

However, if the model fitting involves additional arguments, like 'family', or 'subset' arguments in more complex model fitting functions. If update() methods are available for your model fitting function (which they are for many common fitting functions, like glm()), it provides a simpler way to update a model fit than extracting and reusing the model formula.

如果您打算在R中进行所有建模和将来的预测,那么通过PMML或类似方法将模型抽象出来似乎没有多大意义.

If you intend to do all the modelling and future prediction in R, there doesn't really seem much point in abstracting the model out via PMML or similar.

这篇关于重用R中内置的模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆