aws sagemaker 模型训练中是否存在某种持久性本地存储? [英] Is there some kind of persistent local storage in aws sagemaker model training?

查看:35
本文介绍了aws sagemaker 模型训练中是否存在某种持久性本地存储?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用aws sagemaker做了一些实验,从S3下载大数据集的时间很成问题,尤其是在模型还在开发中,你想要一些比较快的初始反馈的时候

I did some experimentation with aws sagemaker, and the download time of large data sets from S3 is very problematic, especially when the model is still in development, and you want some kind of initial feedback relatively fast

是否有某种本地存储或其他方式来加快速度?

Is there some kind of local storage or other way to speed things up?

编辑我指的是批量训练服务,它允许您将作业作为 docker 容器提交.

EDIT I refer to the batch training service, that allows you to submit a job as a docker container.

虽然此服务适用于通常运行很长时间(这使得下载时间不那么重要)的已经过验证的作业,但仍然需要快速反馈

While this service is intended for already validated jobs that typically run for a long time (which makes the download time less significant) there's still a need for quick feedback

  1. 没有其他方法可以对您的工作与 sagemaker 基础架构(配置文件、数据文件等)进行集成"测试

  1. There's no other way to do the "integration" testing of your job with the sagemaker infrastructure (configuration files, data files, etc.)

在试验模型的不同变体时,能够相对较快地获得初始反馈很重要

When experimenting with different variations to the model, it's important to be able to get initial feedback relatively fast

推荐答案

SageMaker 中有几个不同的服务,每个服务都针对特定用例进行了优化.如果您在谈论开发环境,您可能正在使用笔记本服务.Notebook 实例附带本地 EBS (5GB),您可以使用它来将一些数据复制到其中并运行快速开发迭代,而无需每次都从 S3 复制数据.这样做的方法是从笔记本单元或从目录列表页面打开的终端运行 wgetaws s3 cp.

SageMaker has a few distinct services in it, and each is optimized for a specific use case. If you are talking about the development environment, you are probably using the notebook service. The notebook instance is coming with a local EBS (5GB) that you can use to copy some data into it and run the fast development iterations without copying the data every time from S3. The way to do it is by running wget or aws s3 cp from the notebook cells or from the terminal that you can open from the directory list page.

不过,不建议将过多的数据复制到笔记本实例中,因为这会导致您的训练和实验花费太长时间.相反,您应该使用 SageMaker 的第二部分,即培训服务.一旦您对要训练的模型有了很好的了解,根据笔记本实例上小数据集的快速迭代,您就可以让模型定义在训练实例集群中并行处理更大的数据集.当您发送训练作业时,您还可以定义每个训练实例将使用多少本地存储,但您将从训练的分布式模式中受益最大.

Nevertheless, it is not recommended to copy too much data into the notebook instance, as it will cause your training and experiments to take too long. Instead, you should utilize the second part of SageMaker, which is the training service. Once you have a good sense of the model that you want to train, based on the quick iterations of the small datasets on the notebook instance, you can point your model definition to go over larger datasets in parallel across a cluster of training instances. When you are sending a training job, you can also define how much local storage will be used by each training instance, but you will most benefit from the distributed mode of the training.

当您想要优化您的训练工作时,您有几个存储选项.首先,您可以为每个集群实例定义您希望模型在其上训练的 EBS 卷的大小.您可以在启动训练作业时指定它 (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):

When you want to optimize your training job you have a few options for the storage. First, you can define the size of the EBS volume that you want your model to train on, for each one of the cluster instances. You can specify it when you launch the training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):

...
   "ResourceConfig": { 
      "InstanceCount": number,
      "InstanceType": "string",
      "VolumeKmsKeyId": "string",
      "VolumeSizeInGB": number
   },
...

接下来,您需要决定要训练的模型类型.如果您正在训练自己的模型,您就会知道这些模型是如何获取数据的,包括格式、压缩、来源和其他可能影响将数据加载到模型输入中的性能的因素.如果您更喜欢使用 SageMaker 具有的内置算法,这些算法经过优化以处理 protobuf RecordIO 格式.在此处查看更多信息:https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

Next, you need to decide what kind of models you want to train. If you are training your own models, you know how these models are getting their data, in terms of format, compression, source and other factors that can impact the performance of loading that data into the model input. If you prefer to use the built-in algorithms that SageMaker has, which are optimized to process protobuf RecordIO format. See more information here: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

您可以从中受益(或者学习是否要以更具可扩展性和优化的方式实现自己的模型)的另一个方面是 TrainingInputMode(https://docs.aws.amazon.com/sagemaker/latest/dg/API_AlgorithmSpecification.html#SageMaker-Type-AlgorithmSpecification-TrainingInputMode):

Another aspect that you can benefit from (or learn if you want to implement your own models in a more scalable and optimized way) is the TrainingInputMode (https://docs.aws.amazon.com/sagemaker/latest/dg/API_AlgorithmSpecification.html#SageMaker-Type-AlgorithmSpecification-TrainingInputMode):

类型:字符串

有效值:管道 |文件

必填:是

您可以使用 File 模式从 S3 读取数据文件.但是,您也可以使用 Pipe 模式,它打开了许多选项来以流模式处理数据.这不仅意味着实时数据,使用 AWS Kinesis 或 Kafka 等流媒体服务,而且您还可以从 S3 读取数据并将其流式传输到模型,并且完全避免在训练时将数据存储在本地的需要实例.

You can use the File mode to read the data files from S3. However, you can also use the Pipe mode which opens up a lot of options to process data in a streaming mode. It doesn't mean only real-time data, using streaming services such as AWS Kinesis or Kafka, but also you can read your data from S3 and stream it to the models, and completely avoid the need to store the data locally on the training instances.

这篇关于aws sagemaker 模型训练中是否存在某种持久性本地存储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆