saveRDS夸大对象的大小 [英] saveRDS inflating size of object

查看:76
本文介绍了saveRDS夸大对象的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个棘手的问题,因为我无法提供可重复的示例,但我希望其他人可能对此有经验.

基本上,我有一个函数,可以从数据库中提取大量数据,清理并减小大小,并循环某些参数以生成一系列lm模型对象,参数值和其他参考值.它被编译成一个复杂的列表结构,总计约10mb.

然后应该将其另存为AWS s3上的RDS文件,在生产环境中检索该文件以建立预测.

例如

db.connection <- db.connection.object


build_model_list <- function(db.connection) {   


clean_and_build_models <- function(db.connection, other.parameters) {


get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined

db.data <- get_db_data() 


build_models <- function(db.data, some.parameters) ## Externally defined

clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined


clean.data <- clean_data() 


lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined

lm.model <- lm_model()


return(list(lm.model, other.parameters))} ## Externally defined


looped.model.object <- llply(some.parameters, clean_and_build_models)

return(looped.model.object)}


model.list <- build_model_list()

saveRDS(model.list, "~/a_place/model_list.RDS")

我遇到的问题是,当我在本地另存为RDS或尝试上传到AWS s3时,内存只有10MB的'model.list'对象会膨胀为许多GB.

我应该注意,尽管该函数处理大量数据(约500万行),但输出中使用的数据不超过几百行.

通过在Stack Exchange上阅读有限的信息,我发现将一些外部定义的函数(作为软件包的一部分)移到主要函数(例如clean_data和lm_model)内有助于减小RDS的保存大小. /p>

但是,这有一些很大的缺点.

首先是反复试验,没有明确的逻辑顺序,经常发生崩溃,并且花费了几个小时来构建列表对象,这是一个非常长的调试周期.

第二,这意味着我的主要功能将长达数百行,这将使以后的更改和调试变得更加棘手.

我对你的问题是:

以前有人遇到过这个问题吗?

关于造成这种情况的任何假设?

有没有人找到逻辑上的非试错方法呢?

感谢您的帮助.

解决方案

进行了一些挖掘,但最终我确实找到了解决方案.

事实证明,有罪的是lm模型对象.基于这篇非常有帮助的文章:

https://blogs.oracle.com/R/entry/is_the_size_of_your

事实证明,lm.object $ terms组件包括一个环境组件,该组件引用建立模型时全局环境中存在的对象.在某些情况下,当您保存RDS R时,它将尝试将环境对象拉入保存对象.

由于我在全局环境中有约0.5GB的空间,并且有约200 lm模型对象的列表数组,这导致RDS对象实际上在尝试压缩约100GB的数据时急剧膨胀.

测试这是否是导致问题的原因.执行以下代码:

as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL)))) 

这将告诉您$ terms组件是否在膨胀.

以下代码将从$ terms组件中删除环境引用:

rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment")) 

请注意,尽管它还会删除其引用的所有全局环境对象.

This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.

Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.

It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.

e.g.

db.connection <- db.connection.object


build_model_list <- function(db.connection) {   


clean_and_build_models <- function(db.connection, other.parameters) {


get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined

db.data <- get_db_data() 


build_models <- function(db.data, some.parameters) ## Externally defined

clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined


clean.data <- clean_data() 


lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined

lm.model <- lm_model()


return(list(lm.model, other.parameters))} ## Externally defined


looped.model.object <- llply(some.parameters, clean_and_build_models)

return(looped.model.object)}


model.list <- build_model_list()

saveRDS(model.list, "~/a_place/model_list.RDS")

The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.

I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.

Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.

This however has some big disadvantages.

Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.

Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.

My question to you is:

Has anyone encountered this issue before?

Any hypotheses as to what's causing it?

Has anyone found a logical non-trial-and-error solution to this?

Thanks for your help.

解决方案

It took a bit of digging but I did actually find a solution in the end.

It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:

https://blogs.oracle.com/R/entry/is_the_size_of_your

It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.

As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.

To test if this is what's causing the problem. Execute the following code:

as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL)))) 

This will tell you if the $terms component is inflating.

The following code will remove the environmental references from the $terms component:

rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment")) 

Be warned though it'll also remove all the global environmental objects it references.

这篇关于saveRDS夸大对象的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆