GCP AI平台:创建自定义预测器模型版本时出错(训练后的模型Pytorch模型+ torchvision.transform) [英] GCP AI Platform: Error when creating a custom predictor model version ( trained model Pytorch model + torchvision.transform)

查看:99
本文介绍了GCP AI平台:创建自定义预测器模型版本时出错(训练后的模型Pytorch模型+ torchvision.transform)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我正尝试通过遵循 https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1 .这是基于"Pytorch" 和" torchvision.transform" 的预训练模型的组合.目前,我一直处于错误状态以下,该错误恰好与自定义预测上的500MB约束有关.

Am currently trying to deploy a custom model to AI platform by following https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1. which is based on a combination of the pre-trained model from 'Pytorch' and 'torchvision.transform'. Currently, I keep getting below error which happens to be related to 500MB constraint on custom prediction.

错误:(gcloud.beta.ai-platform.versions.create)创建版本失败.检测到错误的模型并出现错误:模型所需的内存超出了允许的范围.请尝试减小模型尺寸并重新部署.如果您仍然遇到错误,请与支持人员联系.

Setup.py

from setuptools import setup
from pathlib import Path

base = Path(__file__).parent
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")]
print(f"\nPackages: {REQUIRED_PACKAGES}\n\n")

# [torch==1.3.0,torchvision==0.4.1, ImageHash==4.2.0
# Pillow==6.2.1,pyvis==0.1.8.2] installs 800mb worth of files

setup(description="Extract features of a image",
      author='Amrit',
      name='test',
      version='0.1',
      install_requires=REQUIRED_PACKAGES,
      project_urls={
                    'Documentation':'https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines#tensorflow',
                    'Deploy':'https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1',
                    'Ai_platform troubleshooting':'https://cloud.google.com/ai-platform/training/docs/troubleshooting',
                    'Say Thanks!': 'https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 
 7e42a5721b43',
                    'google Torch wheels':"http://storage.googleapis.com/cloud-ai-pytorch/readme.txt",
                    'Torch & torchvision wheels':"https://download.pytorch.org/whl/torch_stable.html "
                    },
    python_requires='~=3.7',
    scripts=['predictor.py', 'preproc.py'])

已采取的步骤:尝试将setup.py文件中的"torch"和torchvision直接添加到"REQUIRED_PACKAGES"列表中,以提供PyTorch + Torchvision作为部署时要安装的依赖项.我猜想,内部Ai平台会为PyTorch下载PyPI软件包,该软件包的大小为+500 MB,这导致我们的模型部署失败.如果我仅使用"torch"部署模型并且它似乎可以正常工作(当然会因为找不到库"torchvision"而引发错误)

Steps taken: Tried adding ‘torch’ and torchvision directly to ‘REQUIRED_PACKAGES’ list in setup.py file in order to provide PyTorch + torchvision as a dependency to be installed while deployment. I am guessing, Internally Ai platform downloads PyPI package for PyTorch which is +500 MB, this results in the failure of our model deployment. If I just deploy the model with 'torch' only and it seems to be working (of course throws error for not able to find library 'torchvision')

文件大小

  • pytorch ( torch-1.3.1 + cpu-cp37-cp37m-linux_x86_64.whl 111MB )
  • torchvision ( torchvision-0.4.1 + cpu-cp37-cp37m-linux_x86_64.whl 46MB )来自 https://download.pytorch.org/whl/torch_stable.html 并将其存储在GKS中.
  • 压缩的预测变量模型文件(.tar.gz格式),它是setup.py的输出( 5kb )
  • 训练有素的PyTorch模型(大小 44MB )
  • pytorch (torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl about 111MB)
  • torchvision (torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl about 46MB) from https://download.pytorch.org/whl/torch_stable.html and stored it on GKS.
  • The zipped predictor model file (.tar.gz format) which is the output of setup.py (5kb )
  • A trained PyTorch model (size 44MB)

总体而言,模型依赖项应小于250MB,但仍然会出现此错误.还尝试使用Google镜像包 http://提供的割炬和割炬视觉storage.googleapis.com/cloud-ai-pytorch/readme.txt ,但仍然存在相同的内存问题.人工智能平台对我们来说是一个全新的平台,希望专业人员提供一些帮助.

In total, the model dependencies should be less than 250MB but still, keep getting this error. Have also tried to use the torch and torchvision provided from Google mirrored packages http://storage.googleapis.com/cloud-ai-pytorch/readme.txt, but same memory issue persists. AI platform is quite new for us and would like some input from professional’s.

GCP CLI输入:

我的环境变量:

BUCKET_NAME= "something"
MODEL_DIR="gs://$BUCKET_NAME/"
VERSION_NAME='v6'
MODEL_NAME="something_model"
STAGING_BUCKET=$MODEL_DIR"staging_area/"
# TORCH_PACKAGE=$MODEL_DIR"package/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
# TORCHVISION_PACKAGE=$MODEL_DIR"package/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCH_PACKAGE="gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCHVISION_PACKAGE="gs://cloud-ai-pytorch/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
CUSTOM_CODE_PATH=$STAGING_BUCKET"imt_ai_predict-0.1.tar.gz"
PREDICTOR_CLASS="predictor.MyPredictor"
REGION='europe-west1'
MACHINE_TYPE='mls1-c4-m2'
 
gcloud beta ai-platform versions create $VERSION_NAME   \
--model=$MODEL_NAME   \
--origin=$MODEL_DIR  \
 --runtime-version=2.3  \
 --python-version=3.7   \
--machine-type=$MACHINE_TYPE  \
 --package-uris=$CUSTOM_CODE_PATH,$TORCH_PACKAGE,$TORCHVISION_PACKAGE   \
--prediction-class=$PREDICTOR_CLASS \ 

GCP CLI输出:

 **[1] global**
 [2] asia-east1
 [3] asia-northeast1
 [4] asia-southeast1
 [5] australia-southeast1
 [6] europe-west1
 [7] europe-west2
 [8] europe-west3
 [9] europe-west4
 [10] northamerica-northeast1
 [11] us-central1
 [12] us-east1
 [13] us-east4
 [14] us-west1
 [15] cancel
Please enter your numeric choice:  1
 
To make this the default region, run `gcloud config set ai_platform/region global`.
 
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......failed.                                                                                                                                            
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: **Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.**

我的发现:已经发现人们以同样的方式为PyTorch软件包而苦苦挣扎的文章,并通过在GCS上安装割炬轮来使其工作(

My finding: Have found articles of people struggling in same ways for PyTorch package and made it work by installing torch wheels on the GCS (https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 7e42a5721b43). Have tried the same approach with torch and torchvision but no luck till now and waiting response from "cloudml-feedback@google.com cloudml-feedback@google.com". Any help on getting custom torch_torchvision based custom predictor working on AI platform that will be great.

推荐答案

结合几件事来解决此问题.我坚持使用4GB CPU MlS1计算机和自定义预测程序例程(<500MB).

Got this fixed by a combination of few things. I stuck to 4gb CPU MlS1 machine and custom predictor routine (<500MB).

  • 使用setup.py参数安装库,而不是仅分析软件包名称及其版本,而是添加正确的割炬轮(最好<100 mb).
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")] +\
['torchvision==0.5.0', 'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl']

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆