AWS Sagemaker模型训练中是否存在某种持久性本地存储? [英] Is there some kind of persistent local storage in aws sagemaker model training?

查看:130
本文介绍了AWS Sagemaker模型训练中是否存在某种持久性本地存储?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对aws sagemaker进行了一些实验,从S3下载大数据集的时间非常成问题,尤其是当模型仍在开发中时,您需要相对较快的初始反馈

I did some experimentation with aws sagemaker, and the download time of large data sets from S3 is very problematic, especially when the model is still in development, and you want some kind of initial feedback relatively fast

是否有某种本地存储或其他方法可以加快速度?

Is there some kind of local storage or other way to speed things up?

编辑 我指的是批培训服务,它使您可以将作业作为Docker容器提交.

EDIT I refer to the batch training service, that allows you to submit a job as a docker container.

尽管此服务用于通常经过很长时间才能运行的已验证工作(这使得下载时间不太重要),但仍然需要快速反馈

While this service is intended for already validated jobs that typically run for a long time (which makes the download time less significant) there's still a need for quick feedback

  1. 没有其他方法可以通过sagemaker基础结构(配置文件,数据文件等)对工作进行集成"测试

  1. There's no other way to do the "integration" testing of your job with the sagemaker infrastructure (configuration files, data files, etc.)

在对模型的不同变体进行试验时,能够相对快速地获得初始反馈很重要

When experimenting with different variations to the model, it's important to be able to get initial feedback relatively fast

推荐答案

SageMaker中有一些独特的服务,并且每种服务都针对特定的用例进行了优化.如果您在谈论开发环境,则可能正在使用笔记本服务.笔记本实例随附一个本地EBS(5GB),您可以使用该EBS将一些数据复制到其中,并运行快速的开发迭代,而无需每次都从S3复制数据.做到这一点的方法是从笔记本电脑单元格或您可以从目录列表页面打开的终端上运行wgetaws s3 cp.

SageMaker has a few distinct services in it, and each is optimized for a specific use case. If you are talking about the development environment, you are probably using the notebook service. The notebook instance is coming with a local EBS (5GB) that you can use to copy some data into it and run the fast development iterations without copying the data every time from S3. The way to do it is by running wget or aws s3 cp from the notebook cells or from the terminal that you can open from the directory list page.

尽管如此,不建议将太多数据复制到笔记本实例中,因为这会使您的训练和实验花费太长时间.相反,您应该利用SageMaker的第二部分,即培训服务.基于笔记本实例上小型数据集的快速迭代,一旦对要训练的模型有了很好的了解,就可以将模型定义指向跨训练实例集群并行遍历较大的数据集.在发送培训作业时,您还可以定义每个培训实例将使用多少本地存储,但是您将最大程度地受益于培训的分布式模式.

Nevertheless, it is not recommended to copy too much data into the notebook instance, as it will cause your training and experiments to take too long. Instead, you should utilize the second part of SageMaker, which is the training service. Once you have a good sense of the model that you want to train, based on the quick iterations of the small datasets on the notebook instance, you can point your model definition to go over larger datasets in parallel across a cluster of training instances. When you are sending a training job, you can also define how much local storage will be used by each training instance, but you will most benefit from the distributed mode of the training.

要优化培训工作,您可以选择几种存储方式.首先,您可以为每个集群实例定义您希望模型对其进行训练的EBS卷的大小.您可以在启动培训作业时指定它( https://docs .aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):

When you want to optimize your training job you have a few options for the storage. First, you can define the size of the EBS volume that you want your model to train on, for each one of the cluster instances. You can specify it when you launch the training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):

...
   "ResourceConfig": { 
      "InstanceCount": number,
      "InstanceType": "string",
      "VolumeKmsKeyId": "string",
      "VolumeSizeInGB": number
   },
...

接下来,您需要确定要训练的模型.如果您正在训练自己的模型,则将了解这些模型如何获取其数据,包括格式,压缩率,来源和其他因素,这些因素可能会影响将数据加载到模型输入中的性能.如果您更喜欢使用SageMaker拥有的内置算法,这些算法经过优化可处理protobuf RecordIO格式.在此处查看更多信息: https://docs.aws.amazon .com/sagemaker/latest/dg/cdf-training.html

Next, you need to decide what kind of models you want to train. If you are training your own models, you know how these models are getting their data, in terms of format, compression, source and other factors that can impact the performance of loading that data into the model input. If you prefer to use the built-in algorithms that SageMaker has, which are optimized to process protobuf RecordIO format. See more information here: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

TrainingInputMode (

类型:字符串

Type: String

有效值:管道|文件

必填:是

您可以使用File模式从S3读取数据文件.但是,您也可以使用Pipe模式,这会打开很多选项以流模式处理数据.它不仅仅意味着使用AWS Kinesis或Kafka等流服务提供实时数据,而且您还可以从S3读取数据并将其流传输到模型中,并且完全避免了需要在培训中本地存储数据实例.

You can use the File mode to read the data files from S3. However, you can also use the Pipe mode which opens up a lot of options to process data in a streaming mode. It doesn't mean only real-time data, using streaming services such as AWS Kinesis or Kafka, but also you can read your data from S3 and stream it to the models, and completely avoid the need to store the data locally on the training instances.

这篇关于AWS Sagemaker模型训练中是否存在某种持久性本地存储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆