GCP AI平台:创建自定义预测器模型版本时出错(训练后的模型Pytorch模型+ torchvision.transform) [英] GCP AI Platform: Error when creating a custom predictor model version ( trained model Pytorch model + torchvision.transform)
问题描述
当前,我正尝试通过遵循 https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1 .这是基于"Pytorch" 和" torchvision.transform" 的预训练模型的组合.目前,我一直处于错误状态以下,该错误恰好与自定义预测上的500MB约束有关.
Am currently trying to deploy a custom model to AI platform by following https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1. which is based on a combination of the pre-trained model from 'Pytorch' and 'torchvision.transform'. Currently, I keep getting below error which happens to be related to 500MB constraint on custom prediction.
错误:(gcloud.beta.ai-platform.versions.create)创建版本失败.检测到错误的模型并出现错误:模型所需的内存超出了允许的范围.请尝试减小模型尺寸并重新部署.如果您仍然遇到错误,请与支持人员联系.
Setup.py
from setuptools import setup
from pathlib import Path
base = Path(__file__).parent
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")]
print(f"\nPackages: {REQUIRED_PACKAGES}\n\n")
# [torch==1.3.0,torchvision==0.4.1, ImageHash==4.2.0
# Pillow==6.2.1,pyvis==0.1.8.2] installs 800mb worth of files
setup(description="Extract features of a image",
author='Amrit',
name='test',
version='0.1',
install_requires=REQUIRED_PACKAGES,
project_urls={
'Documentation':'https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines#tensorflow',
'Deploy':'https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1',
'Ai_platform troubleshooting':'https://cloud.google.com/ai-platform/training/docs/troubleshooting',
'Say Thanks!': 'https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform-
7e42a5721b43',
'google Torch wheels':"http://storage.googleapis.com/cloud-ai-pytorch/readme.txt",
'Torch & torchvision wheels':"https://download.pytorch.org/whl/torch_stable.html "
},
python_requires='~=3.7',
scripts=['predictor.py', 'preproc.py'])
已采取的步骤:尝试将setup.py文件中的"torch"和torchvision直接添加到"REQUIRED_PACKAGES"列表中,以提供PyTorch + Torchvision作为部署时要安装的依赖项.我猜想,内部Ai平台会为PyTorch下载PyPI软件包,该软件包的大小为+500 MB,这导致我们的模型部署失败.如果我仅使用"torch"部署模型并且它似乎可以正常工作(当然会因为找不到库"torchvision"而引发错误)
Steps taken: Tried adding ‘torch’ and torchvision directly to ‘REQUIRED_PACKAGES’ list in setup.py file in order to provide PyTorch + torchvision as a dependency to be installed while deployment. I am guessing, Internally Ai platform downloads PyPI package for PyTorch which is +500 MB, this results in the failure of our model deployment. If I just deploy the model with 'torch' only and it seems to be working (of course throws error for not able to find library 'torchvision')
文件大小
- pytorch ( torch-1.3.1 + cpu-cp37-cp37m-linux_x86_64.whl 约 111MB )
- torchvision ( torchvision-0.4.1 + cpu-cp37-cp37m-linux_x86_64.whl 约 46MB )来自 https://download.pytorch.org/whl/torch_stable.html 并将其存储在GKS中.
- 压缩的预测变量模型文件(.tar.gz格式),它是setup.py的输出( 5kb )
- 训练有素的PyTorch模型(大小 44MB )
- pytorch (torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl about 111MB)
- torchvision (torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl about 46MB) from https://download.pytorch.org/whl/torch_stable.html and stored it on GKS.
- The zipped predictor model file (.tar.gz format) which is the output of setup.py (5kb )
- A trained PyTorch model (size 44MB)
总体而言,模型依赖项应小于250MB,但仍然会出现此错误.还尝试使用Google镜像包 http://提供的割炬和割炬视觉storage.googleapis.com/cloud-ai-pytorch/readme.txt ,但仍然存在相同的内存问题.人工智能平台对我们来说是一个全新的平台,希望专业人员提供一些帮助.
In total, the model dependencies should be less than 250MB but still, keep getting this error. Have also tried to use the torch and torchvision provided from Google mirrored packages http://storage.googleapis.com/cloud-ai-pytorch/readme.txt, but same memory issue persists. AI platform is quite new for us and would like some input from professional’s.
GCP CLI输入:
我的环境变量:
BUCKET_NAME= "something"
MODEL_DIR="gs://$BUCKET_NAME/"
VERSION_NAME='v6'
MODEL_NAME="something_model"
STAGING_BUCKET=$MODEL_DIR"staging_area/"
# TORCH_PACKAGE=$MODEL_DIR"package/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
# TORCHVISION_PACKAGE=$MODEL_DIR"package/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCH_PACKAGE="gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCHVISION_PACKAGE="gs://cloud-ai-pytorch/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
CUSTOM_CODE_PATH=$STAGING_BUCKET"imt_ai_predict-0.1.tar.gz"
PREDICTOR_CLASS="predictor.MyPredictor"
REGION='europe-west1'
MACHINE_TYPE='mls1-c4-m2'
gcloud beta ai-platform versions create $VERSION_NAME \
--model=$MODEL_NAME \
--origin=$MODEL_DIR \
--runtime-version=2.3 \
--python-version=3.7 \
--machine-type=$MACHINE_TYPE \
--package-uris=$CUSTOM_CODE_PATH,$TORCH_PACKAGE,$TORCHVISION_PACKAGE \
--prediction-class=$PREDICTOR_CLASS \
GCP CLI输出:
**[1] global**
[2] asia-east1
[3] asia-northeast1
[4] asia-southeast1
[5] australia-southeast1
[6] europe-west1
[7] europe-west2
[8] europe-west3
[9] europe-west4
[10] northamerica-northeast1
[11] us-central1
[12] us-east1
[13] us-east4
[14] us-west1
[15] cancel
Please enter your numeric choice: 1
To make this the default region, run `gcloud config set ai_platform/region global`.
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: **Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.**
我的发现:已经发现人们以同样的方式为PyTorch软件包而苦苦挣扎的文章,并通过在GCS上安装割炬轮来使其工作(
My finding: Have found articles of people struggling in same ways for PyTorch package and made it work by installing torch wheels on the GCS (https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 7e42a5721b43). Have tried the same approach with torch and torchvision but no luck till now and waiting response from "cloudml-feedback@google.com cloudml-feedback@google.com". Any help on getting custom torch_torchvision based custom predictor working on AI platform that will be great.
推荐答案
结合几件事来解决此问题.我坚持使用4GB CPU MlS1计算机和自定义预测程序例程(<500MB).
Got this fixed by a combination of few things. I stuck to 4gb CPU MlS1 machine and custom predictor routine (<500MB).
- 使用setup.py参数安装库,而不是仅分析软件包名称及其版本,而是添加正确的割炬轮(最好<100 mb).
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")] +\
['torchvision==0.5.0', 'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl']
- 我减少了进行预处理的步骤.无法同时容纳所有内容,因此请从preproc.py和predictor.py中对您的SEND响应进行json处理并获取一个
import json
json.dump(your data to send to predictor class)
- 从所需库的类中导入这些功能.
from torch import zeros,load
your code
[重要]
-
还没有针对训练后的模型测试过不同类型的序列化对象,这可能与节省内存的方法(torch.save,pickle,joblib等)有所不同.
[Important]
Haven't tested different types of serializing object for the trained model, could be a difference there as well to which one (torch.save, pickle, joblib etc) is memory saving.
为组织是GCP合作伙伴的那些人找到了此链接,他们可能能够请求更多配额(我猜是从500MB到2GB左右).我的问题解决了,其他人突然出现了,所以不必朝这个方向走. https://cloud.google.com/ai-platform/training/docs/配额
Found this link for those whose organization is a partner with GCP might be able to request more quota (am guessing from 500MB to 2GB or so). Didn't had to go in this direction as my issue was resolved and other poped up lol. https://cloud.google.com/ai-platform/training/docs/quotas
这篇关于GCP AI平台:创建自定义预测器模型版本时出错(训练后的模型Pytorch模型+ torchvision.transform)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!