AWS Sagemaker自定义用户算法:如何利用额外的实例 [英] AWS Sagemaker custom user algorithms: how to take advantage of extra instances

查看:117
本文介绍了AWS Sagemaker自定义用户算法:如何利用额外的实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个基本的AWS Sagemaker问题.当我使用Sagemaker的内置算法之一进行训练时,通过增加训练算法的instance_count参数,我可以充分利用大幅提高的速度,将工作分配给许多实例.但是,当我打包自己的自定义算法时,增加实例数似乎只是在每个实例上重复训练,因此没有加速.

This is a fundamental AWS Sagemaker question. When I run training with one of Sagemaker's built in algorithms I am able to take advantage of the massive speedup from distributing the job to many instances by increasing the instance_count argument of the training algorithm. However, when I package my own custom algorithm then increasing the instance count seems to just duplicate the training on every instance, leading to no speedup.

我怀疑在打包自己的算法时,需要做一些特殊的事情来控制它如何针对我的自定义train()函数内部的特定实例以不同的方式处理训练(否则,它将如何知道应该分配工作?),但我还没有找到有关如何在线进行此工作的任何讨论.

I suspect that when I am packaging my own algorithm there is something special I need to do to control how it handles the training differently for a particular instance inside of the my custom train() function (otherwise, how would it know how the job should be distributed?), but I have not been able to find any discussion of how to do this online.

有人知道如何处理吗?提前非常感谢您.

Does anyone know how to handle this? Thank you very much in advance.

具体示例: =>在标准算法中效果很好:我验证了在第一个记录的sagemaker示例中增加train_instance_count可以加快此处的速度:

Specific examples: => It works well in a standard algorithm: I verified that increasing train_instance_count in the first documented sagemaker example speeds things up here: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html

=>在我的自定义算法中不起作用.我尝试以标准的sklearn构建自己的模型示例为例,并在训练中添加了一些额外的sklearn变体,然后打印出结果进行比较.当我增加传递给Estimator对象的train_instance_count时,它在每个实例上运行相同的训练,因此输出在每个实例上重复(结果的打印输出重复),并且没有加速. 这是sklearn示例库: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb .在此笔记本中,Estimator对象的第三个参数是使您可以控制训练实例数量的方法.

=> It does not work in my custom algorithm. I tried taking the standard sklearn build-your-own-model example and adding a few extra sklearn variants inside of the training and then printing out results to compare. When I increase the train_instance_count that is passed to the Estimator object, it runs the same training on every instance, so the output gets duplicated across each instance (the printouts of the results are duplicated) and there is no speedup. This is the sklearn example base: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb . The third argument of the Estimator object partway down in this notebook is what lets you control the number of training instances.

推荐答案

分布式培训要求有一种方法可以使培训人员之间的培训结果保持同步.大多数传统库(例如scikit-learn)都设计为只能由一个工作人员使用,并且不能仅在分布式环境中使用. Amazon SageMaker正在跨工作人员分发数据,但是您需要确保算法可以从多个工作人员中受益.一些算法(例如随机森林)更容易利用分布,因为每个工作人员都可以构建森林的不同部分,但是其他算法需要更多帮助.

Distributed training requires having a way to sync the results of the training between the training workers. Most of the traditional libraries, such as scikit-learn are designed to work with a single worker, and can't just be used in a distributed environment. Amazon SageMaker is distributing the data across the workers, but it is up to you to make sure that the algorithm can benefit from the multiple workers. Some algorithms, such as Random Forest, are easier to take advantage of the distribution, as each worker can build a different part of the forest, but other algorithms need more help.

Spark MLLib已经分发了流行算法的实现,例如k均值,逻辑回归或PCA,但是这些实现在某些情况下还不够好.当大量数据用于训练时,它们中的大多数速度太慢,有的甚至被击碎. Amazon SageMaker团队从头开始重新实现了许多这些算法,以从云的规模和经济效益中受益(一个实例20小时的成本与20个实例1小时的成本相同,只是快20倍).现在,这些算法中的许多算法都比线性可伸缩性更稳定,速度更快.在此处查看更多详细信息: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Spark MLLib has distributed implementations of popular algorithms such as k-means, logistic regression, or PCA, but these implementations are not good enough for some cases. Most of them were too slow and some even crushed when a lot of data was used for the training. The Amazon SageMaker team reimplemented many of these algorithms from scratch to benefit from the scale and economics of the cloud (20 hours of one instance costs the same as 1 hour of 20 instances, just 20 times faster). Many of these algorithms are now more stable and much faster beyond the linear scalability. See more details here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

对于深度学习框架(TensorFlow和MXNet),SageMaker使用的是每个人都在使用的内置参数服务器,但是它却在繁琐的构建集群和配置实例以与其进行通信方面进行了繁重的工作.

For the deep learning frameworks (TensorFlow and MXNet) SageMaker is using the built-in parameters server that each one is using, but it is taking the heavy lifting of the building the cluster and configuring the instances to communicate with it.

这篇关于AWS Sagemaker自定义用户算法:如何利用额外的实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆