为什么插入号训练会占用这么多内存? [英] Why is caret train taking up so much memory?

查看:133
本文介绍了为什么插入号训练会占用这么多内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我仅使用glm进行训练时,一切正常,甚至还没有耗尽内存.但是,当我运行train(..., method='glm')时,内存不足.

When I train just using glm, everything works, and I don't even come close to exhausting memory. But when I run train(..., method='glm'), I run out of memory.

这是因为train为交叉验证的每次迭代(或任何trControl过程是什么)都存储大量数据吗?我正在查看trainControl,但找不到如何防止这种情况...任何提示?我只关心性能摘要,也可能关心预测的响应.

Is this because train is storing a lot of data for each iteration of the cross-validation (or whatever the trControl procedure is)? I'm looking at trainControl and I can't find how to prevent this...any hints? I only care about the performance summary and maybe the predicted responses.

(我知道这与存储参数调整网格搜索的每次迭代中的数据无关,因为我认为glm没有网格.)

(I know it's not related to storing data from each iteration of the parameter-tuning grid search because there's no grid for glm's, I believe.)

推荐答案

问题有两个方面. i) train不仅通过glm() 拟合模型,还将引导该模型,因此即使使用默认设置,train()也会进行25次引导样本,再加上问题ii)是问题的 (或 a )来源,而 ii) train()则简单地称为glm()函数具有默认值.这些默认值是存储模型框架(?glm的参数model = TRUE),其中包括模型框架样式的数据副本.由train()返回的对象已经在$trainingData中存储了数据的副本,并且在$finalModel中的"glm"对象也具有了实际数据的副本.

The problem is two fold. i) train doesn't just fit a model via glm(), it will bootstrap that model, so even with the defaults, train() will do 25 bootstrap samples, which, coupled with problem ii) is the (or a) source of your problem, and ii) train() simply calls the glm() function with its defaults. And those defaults are to store the model frame (argument model = TRUE of ?glm), which includes a copy of the data in model frame style. The object returned by train() already stores a copy of the data in $trainingData, and the "glm" object in $finalModel also has a copy of the actual data.

这时,仅使用train()运行glm()将产生25张完全扩展的model.frame 原始数据,这些数据将在运行期间全部保存在内存中重新采样的过程-通过同时在lapply()调用中进行重新采样,快速查看代码并不能立即清除它们是同时保留还是连续保留.还将有25个原始数据副本.

At this point, simply running glm() using train() will be producing 25 copies of the fully expanded model.frame and the original data, which will all need to be held in memory during the resampling process - whether these are held concurrently or consecutively is not immediately clear from a quick look at the code as the resampling happens in an lapply() call. There will also be 25 copies of the raw data.

重新采样完成后,返回的对象将包含原始数据的2个副本和model.frame的完整副本.如果您的训练数据相对于可用RAM而言很大,或者包含许多要在model.frame中扩展的因素,那么您可能很容易在使用大量内存的同时只是携带数据的副本.

Once the resampling is finished, the returned object will contain 2 copies of the raw data and a full copy of the model.frame. If your training data is large relative to available RAM or contains many factors to be expanded in the model.frame, then you could easily be using huge amounts of memory just carrying copies of the data around.

如果在火车通话中添加model = FALSE,则可能会有所不同.这是一个使用?glm中的clotting数据的小示例:

If you add model = FALSE to your train call, that might make a difference. Here is a small example using the clotting data in ?glm:

clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100),
                       lot1 = c(118,58,42,35,27,25,21,19,18),
                       lot2 = c(69,35,26,21,18,16,13,12,12))
require(caret)

然后

> m1 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm", 
+             model = TRUE)
Fitting: parameter=none 
Aggregating results
Fitting model on full training set
> m2 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm",
+             model = FALSE)
Fitting: parameter=none 
Aggregating results
Fitting model on full training set
> object.size(m1)
121832 bytes
> object.size(m2)
116456 bytes
> ## ordinary glm() call:
> m3 <- glm(lot1 ~ log(u), data=clotting, family = Gamma)
> object.size(m3)
47272 bytes
> m4 <- glm(lot1 ~ log(u), data=clotting, family = Gamma, model = FALSE)
> object.size(m4)
42152 bytes

因此返回的对象存在大小差异,并且训练期间的内存使用量会更低.降低多少取决于在重新采样过程中train()的内部结构是否将model.frame的所有副本都保留在内存中.

So there is a size difference in the returned object and memory use during training will be lower. How much lower will depend on whether the internals of train() keep all copies of the model.frame in memory during the resampling process.

train()返回的对象也比glm()返回的对象大得多-正如@DWin在下面的注释中提到的.

The object returned by train() is also significantly larger than that returned by glm() - as mentioned by @DWin in the comments, below.

要进一步研究此问题,请更仔细地研究代码,或者发送电子邮件给插入符的维护者Max Kuhn,以询问减少内存占用的选项.

To take this further, either study the code more closely, or email Max Kuhn, the maintainer of caret, to enquire about options to reduce the memory footprint.

这篇关于为什么插入号训练会占用这么多内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆