改善多节点群集上的H2O DRF运行时 [英] Improve h2o DRF runtime on a multi-node cluster

查看:109
本文介绍了改善多节点群集上的H2O DRF运行时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在运行一个3节点EC2群集(h2o的DRF算法. 我的数据集有100万行和41列(40个预测变量和1个响应).

I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes). My data set has 1m rows and 41 columns (40 predictors and 1 response).

我使用R绑定来控制集群,RF调用如下

I use the R bindings to control the cluster and the RF call is as follows

model=h2o.randomForest(x=x,
                       y=y,
                       ignore_const_cols=TRUE,
                       training_frame=train_data,
                       seed=1234,
                       mtries=7,
                       ntrees=2000,
                       max_depth=15,
                       min_rows=50,
                       stopping_rounds=3,
                       stopping_metric="MSE",
                       stopping_tolerance=2e-5)

对于3节点群集(c4.8xlarge,增强型网络已打开),此过程大约需要240秒. CPU利用率在10%至20%之间; RAM利用率在20%至30%之间;网络传输速度在10-50MByte/sec之间(输入和输出).建造了300棵树,直到尽早停止.

For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.

单节点群集上,我可以在大约80秒内获得相同的结果.因此,对于3节点群集,我的速度降低了3倍,而不是预期的3倍加速.

On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.

我做了一些研究,发现了一些报告相同问题的资源(虽然不如我的极端).例如,请参阅: https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8

I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance: https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8

具体来说,是> http://datascience.la/benchmarking-random-forest的作者-implementations/指出

虽然不是本研究的重点,但有迹象表明, 多个上的分布式随机森林实现(例如H2O) 节点无法提供希望的速度优势(因为 在每个拆分中运送直方图的高成本 网络).

While not the focus of this study, there are signs that running the distributed random forests implementations (e.g. H2O) on multiple nodes does not provide the speed benefit one would hope for (because of the high cost of shipping the histograms at each split over the network).

https://www.slideshare.net/0xdata/rf-brighttalk指向2种不同的DRF实现,其中一种具有较大的网络开销.

Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.

我认为我遇到了上述链接中所述的相同问题. 如何在多节点群集上提高h2o的DRF性能? 是否有任何设置可以改善运行时间? 任何帮助,高度赞赏!

I think that I am running into the same problems as described in the links above. How can I improve h2o's DRF performance on a multi-node cluster? Are there any settings that might improve runtime? Any help highly appreciated!

推荐答案

如果您的随机森林在多节点H2O群集上的运行速度较慢,则仅意味着您的数据集不足以利用分布式计算.在群集节点之间进行通信会产生开销,因此,如果您可以在单个节点上成功训练模型,那么使用单个节点将总是更快.

If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.

多节点设计用于当数据太大而无法在单个节点上训练时.只有这样,才值得使用多个节点.否则,您将无缘无故地增加通信开销,并且会看到您观察到的减速类型.

Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.

如果您的数据适合一台机器上的内存(并且您可以成功地训练一个模型而没有内存用完),那么加快训练速度的方法就是切换到具有更多内核的机器上.您还可以尝试使用某些会影响训练速度的参数值,以查看是否可以加快速度,但这通常是以牺牲模型性能为代价的.

If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.

这篇关于改善多节点群集上的H2O DRF运行时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆