AWS Sagemaker 自定义用户算法:如何利用额外实例 [英] AWS Sagemaker custom user algorithms: how to take advantage of extra instances

查看:15
本文介绍了AWS Sagemaker 自定义用户算法:如何利用额外实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个基本的 AWS Sagemaker 问题.当我使用 Sagemaker 的一种内置算法进行训练时,我能够通过增加训练算法的 instance_count 参数来利用将作业分配到多个实例的巨大加速.然而,当我打包我自己的自定义算法时,增加实例数量似乎只是在每个实例上重复训练,导致没有加速.

This is a fundamental AWS Sagemaker question. When I run training with one of Sagemaker's built in algorithms I am able to take advantage of the massive speedup from distributing the job to many instances by increasing the instance_count argument of the training algorithm. However, when I package my own custom algorithm then increasing the instance count seems to just duplicate the training on every instance, leading to no speedup.

我怀疑当我打包我自己的算法时,我需要做一些特别的事情来控制它如何以不同的方式处理我的自定义 train() 函数内的特定实例的训练(否则,它怎么知道如何应该分配工作?),但我无法在网上找到有关如何执行此操作的任何讨论.

I suspect that when I am packaging my own algorithm there is something special I need to do to control how it handles the training differently for a particular instance inside of the my custom train() function (otherwise, how would it know how the job should be distributed?), but I have not been able to find any discussion of how to do this online.

有人知道怎么处理吗?预先非常感谢您.

Does anyone know how to handle this? Thank you very much in advance.

具体例子:=> 它在标准算法中运行良好:我证实在第一个记录的 sagemaker 示例中增加 train_instance_count 可以加快速度:https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html

Specific examples: => It works well in a standard algorithm: I verified that increasing train_instance_count in the first documented sagemaker example speeds things up here: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html

=> 它在我的自定义算法中不起作用.我尝试采用标准的 sklearn build-your-own-model 示例,并在训练中添加一些额外的 sklearn 变体,然后打印出结果进行比较.当我增加传递给 Estimator 对象的 train_instance_count 时,它在每个实例上运行相同的训练,因此输出在每个实例中重复(结果的打印输出重复)并且没有加速.这是 sklearn 示例库:https:///github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb.本笔记本中段的 Estimator 对象的第三个参数可让您控制训练实例的数量.

=> It does not work in my custom algorithm. I tried taking the standard sklearn build-your-own-model example and adding a few extra sklearn variants inside of the training and then printing out results to compare. When I increase the train_instance_count that is passed to the Estimator object, it runs the same training on every instance, so the output gets duplicated across each instance (the printouts of the results are duplicated) and there is no speedup. This is the sklearn example base: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb . The third argument of the Estimator object partway down in this notebook is what lets you control the number of training instances.

推荐答案

分布式培训需要有一种方法可以在培训工作者之间同步培训结果.大多数传统库,例如 scikit-learn 都设计为与单个 worker 一起工作,而不能仅在分布式环境中使用.Amazon SageMaker 正在将数据分发到工作人员之间,但由您来确保算法可以从多个工作人员中受益.一些算法,例如随机森林,更容易利用分布,因为每个工人可以构建森林的不同部分,但其他算法需要更多帮助.

Distributed training requires having a way to sync the results of the training between the training workers. Most of the traditional libraries, such as scikit-learn are designed to work with a single worker, and can't just be used in a distributed environment. Amazon SageMaker is distributing the data across the workers, but it is up to you to make sure that the algorithm can benefit from the multiple workers. Some algorithms, such as Random Forest, are easier to take advantage of the distribution, as each worker can build a different part of the forest, but other algorithms need more help.

Spark MLLib 具有流行算法的分布式实现,例如 k-means、逻辑回归或 PCA,但这些实现在某些情况下还不够好.他们中的大多数都太慢了,有些甚至在使用大量数据进行训练时被粉碎.Amazon SageMaker 团队从头开始重新实现了其中的许多算法,以从云的规模和经济性中受益(一个实例的 20 小时成本与 20 个实例的 1 小时成本相同,只是快了 20 倍).许多这些算法现在比线性可扩展性更稳定,速度更快.在此处查看更多详细信息:https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Spark MLLib has distributed implementations of popular algorithms such as k-means, logistic regression, or PCA, but these implementations are not good enough for some cases. Most of them were too slow and some even crushed when a lot of data was used for the training. The Amazon SageMaker team reimplemented many of these algorithms from scratch to benefit from the scale and economics of the cloud (20 hours of one instance costs the same as 1 hour of 20 instances, just 20 times faster). Many of these algorithms are now more stable and much faster beyond the linear scalability. See more details here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

对于深度学习框架(TensorFlow 和 MXNet),SageMaker 使用每个框架都在使用的内置参数服务器,但它承担了构建集群和配置实例以与其通信的繁重工作.

For the deep learning frameworks (TensorFlow and MXNet) SageMaker is using the built-in parameters server that each one is using, but it is taking the heavy lifting of the building the cluster and configuring the instances to communicate with it.

这篇关于AWS Sagemaker 自定义用户算法:如何利用额外实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆