培训失败-AWS Machine Learning [英] Training Failed - AWS Machine Learning

查看:116
本文介绍了培训失败-AWS Machine Learning的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MERN(Mongodb,Express,React,NodeJS)Stack Code进行AWS机器学习.但是问题是,当我在一段时间后的过程训练失败后上传数据文件(.csv文件)以进行过程机器学习时,问题是出现以下TrainingFailed错误:

I am working on Aws Machine learning with MERN(Mongodb,Express,React,NodeJS)Stack Code.But the issue is that when I upload the data file (.csv file) for process machine learning after sometime process training is failed with TrainingFailed Error which is follow:

算法错误:CannotStartContainerError.请确保该容器可以使用"docker run train"运行.有关详细信息,请参阅SageMaker文档.Dockerfile的入口点可能未正确定义,或者缺少权限.

我还在AWS账户中设置了以下设置.

I also setup the following settings in AWS Account.

还要在AWS账户中授予以下权限:

Also give following permissions in AWS Account:

我还在所有设置和权限后应用了mongodb配置设置中的所有键,我不明白我需要进行机器学习的过程.实际上训练尚未完成,无法在s3存储桶中获得modelartifacts,它看起来像:sagemaker进程未启动.有人可以帮我吗?

I also apply all the keys in mongodb configuration settings after all the settings and permissions I can not understand what I need to process of Machine learning.Actually Training is not completed and can not get modelartifacts in s3 bucket.Its look like : sagemaker process is not started . can any one help me about this?

我的DockerFile,它以名为Dockerfile的形式存储在项目文件夹中.

My DockerFile which is stored at the project folder with named Dockerfile.

FROM ubuntu
RUN apt-get update
RUN apt-get install curl -y
RUN curl -sL https://deb.nodesource.com/setup_10.x -o nodesource_setup.sh
RUN bash nodesource_setup.sh
RUN apt install nodejs -y
WORKDIR /usr/app
COPY . /usr/app/
RUN npm install
EXPOSE 3000
ENTRYPOINT [ "python3.7", "/opt/ml/code/train.py" ]

我还在Docker Hub中为Sagemaker线性学习器和xgboost设置了代码映像,并且还在AWS的ECR中创建了存储库.

I also set Code Images in Docker Hub for Sagemaker linear learner and xgboost and also create repositories in ECR in aws.

我还在aws的 opt/ml/code/train.py目录中复制train.py并获得了输出输出:/home/ec2-user/SageMaker/docker_test_folder ,但仍然出现此错误.

I also copy train.py in opt/ml/code/train.py directory in aws and also got the output output: /home/ec2-user/SageMaker/docker_test_folder but still got this error.

推荐答案

您收到的错误意味着sagemaker无法启动docker映像,这是因为您没有正确定义入口点.您可以看一下我的回购.基本上,在您的dockerfile中,您必须安装一些软件包,创建一个名为/opt/ml/code 的文件夹,并将您的训练脚本放在该文件夹中,该脚本将称为 train . train 文件应遵循一些指示,您可以阅读此处.

The error you get means that sagemaker is not able to launch your docker image, this is because you have not defined correctly the entry point. You can a take a look at my repo. Basically in your dockerfile you have to install some packages, create a folder let's say /opt/ml/code and put in this folder your training script that will be called train. The train file should respect some indications that you can read here.

这篇关于培训失败-AWS Machine Learning的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆