saveRDS 膨胀对象的大小 [英] saveRDS inflating size of object

查看:34
本文介绍了saveRDS 膨胀对象的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个棘手的问题,因为我无法提供可重现的示例,但我希望其他人可能有处理此问题的经验.

This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.

本质上,我有一个函数可以从数据库中提取大量数据,清理并减小大小并循环访问一些参数以生成一系列 lm 模型对象、参数值和其他参考值.这被编译成一个复杂的列表结构,总大小约为 10mb.

Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.

然后应该将其保存为 AWS s3 上的 RDS 文件,在生产环境中检索它以构建预测.

It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.

例如

db.connection <- db.connection.object


build_model_list <- function(db.connection) {   


clean_and_build_models <- function(db.connection, other.parameters) {


get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined

db.data <- get_db_data() 


build_models <- function(db.data, some.parameters) ## Externally defined

clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined


clean.data <- clean_data() 


lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined

lm.model <- lm_model()


return(list(lm.model, other.parameters))} ## Externally defined


looped.model.object <- llply(some.parameters, clean_and_build_models)

return(looped.model.object)}


model.list <- build_model_list()

saveRDS(model.list, "~/a_place/model_list.RDS")

我遇到的问题是,当我在本地保存为 RDS 或尝试上传到 AWS s3 时,内存中只有 10MB 的model.list"对象会膨胀到很多 GB.

The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.

我应该注意,虽然该函数处理大量数据(约 500 万行),但输出中使用的数据不超过几百行.

I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.

阅读 Stack Exchange 上的有限信息后,我发现在主函数(例如 clean_data 和 lm_model)中移动一些外部定义的函数(作为包的一部分)有助于减少 RDS 保存大小.

Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.

然而,这有一些很大的缺点.

This however has some big disadvantages.

首先,它是反复试验,没有明确的逻辑顺序,经常崩溃,构建列表对象需要几个小时,这是一个非常长的调试周期.

Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.

其次,这意味着我的主要功能将有数百行长,这将使未来的更改和调试变得更加棘手.

Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.

我的问题是:

以前有人遇到过这个问题吗?

Has anyone encountered this issue before?

关于是什么原因的任何假设?

Any hypotheses as to what's causing it?

有没有人找到一个合乎逻辑的非试错解决方案?

Has anyone found a logical non-trial-and-error solution to this?

感谢您的帮助.

推荐答案

我花了一点时间挖掘,但最终我确实找到了解决方案.

It took a bit of digging but I did actually find a solution in the end.

原来是 lm 模型对象是有罪的一方.基于这篇非常有用的文章:

It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:

https://blogs.oracle.com/R/entry/is_the_size_of_your

事实证明,lm.object$terms 组件包含一个环境组件,该组件在构建模型时引用全局环境中存在的对象.在某些情况下,当你 saveRDS R 会尝试将环境对象绘制到保存对象中.

It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.

由于我在全局环境中有 ~0.5GB 和一个包含 ~200 个 lm 模型对象的列表数组,这导致 RDS 对象急剧膨胀,因为它实际上试图压缩 ~100GB 的数据.

As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.

测试这是否是导致问题的原因.执行以下代码:

To test if this is what's causing the problem. Execute the following code:

as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL)))) 

这会告诉你 $terms 组件是否在膨胀.

This will tell you if the $terms component is inflating.

以下代码将从 $terms 组件中删除环境引用:

The following code will remove the environmental references from the $terms component:

rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment")) 

尽管它也会删除它引用的所有全局环境对象,但请注意.

Be warned though it'll also remove all the global environmental objects it references.

这篇关于saveRDS 膨胀对象的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆