R H2O - 内存管理 [英] R H2O - Memory management

查看:150
本文介绍了R H2O - 内存管理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用H2O通过R使用一个大数据集(~10GB)的子集构建多个模型。数据是一年的数据,我正在尝试建立51个模型(即第1周的训练,第2周预测等),每周约有1.5-250万行,有8个变量。

I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.

我在循环中完成了这个,我知道这并不总是R中最好的方法。我发现的另一个问题是H2O实体会积累先前的对象,所以我创建了删除除主数据集之外的所有函数。

I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.

h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
  # Find all objects on server
  keysToKill <- h2o.ls(clust)$Key
  # Remove items to be excluded, if any
  keysToKill <- setdiff(keysToKill, vte)
  # Loop thru and remove items to be removed
  for(i in keysToKill){
    h2o.rm(object = clust, keys = i)

    if(verbose == TRUE){
      print(i);flush.console()

    }    
  }
  # Print remaining objects in cluster.
  h2o.ls(clust)
}

脚本运行正常然后崩溃 - 通常抱怨内存不足和交换到磁盘。

The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.

这里有一些描述过程的伪代码

Here's some pseudo code to describe the process

# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")

# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
                      , nfolds = 3
                      , importance = T
                      , distribution = 'bernoulli' 
                      , n.trees = 100
                      , interaction.depth = 10,
                      , shrinkage = 0.01
  )
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)@model$auc

# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))

} # end loop

我的问题是,如果有的话,是正确的管理独立实体中的数据和内存的方法(这不是在hadoop或集群上运行 - 只是一个大型EC2实例(~64gb RAM + 12个CPU))这种类型的进程?我应该在每次循环后杀死并重新创建H2O实体(这是原始过程,但每次从文件读取数据每次迭代增加约10分钟)?在每次循环后有没有正确的方法来垃圾收集或释放内存?

My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?

任何建议都会受到赞赏。

Any suggestions would be appreciated.

推荐答案

这个答案适合原始H2O项目(发布2.xyz)。

This answer is for the original H2O project (releases 2.x.y.z).

在原始H2O项目中,H2O R包在H2O集群DKV(分布式密钥)中创建了大量临时H2O对象/ Value store)带有Last.value前缀。

In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.

这些在Web UI的Store View中和从R调用h2o.ls()都可见。

These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.

我建议做的是:


  • 在每个循环的底部迭代,使用h2o.assign()对要保存到已知密钥名称的任何内容进行深层复制

  • 使用h2o.rm()删除任何您不想要的内容保持,特别是Last.valuetemps

  • 在循环中的某个地方显式调用gc()

这是一个为您删除Last.value临时对象的函数。传入H2O连接对象作为参数:

Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:

removeLastValues <- function(conn) {
    df <- h2o.ls(conn)
    keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
    unique_keys_to_remove = unique(keys_to_remove)
    if (length(unique_keys_to_remove) > 0) {
        h2o.rm(conn, unique_keys_to_remove)
    }
}

这是H2O github存储库中R测试的链接,它使用这种技术,可以无限运行而不会耗尽内存:

Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:

https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R

这篇关于R H2O - 内存管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆