无法使用自定义的预测例程将经过训练的模型部署到Google Cloud Ai平台:模型所需的内存超出了允许的范围 [英] Cannot deploy trained model to Google Cloud Ai-Platform with custom prediction routine: Model requires more memory than allowed

查看：125 发布时间：2020/7/23 1:35:58 google-cloud-platform pytorch google-cloud-ml gcp-ai-platform-training

本文介绍了无法使用自定义的预测例程将经过训练的模型部署到Google Cloud Ai平台:模型所需的内存超出了允许的范围的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试部署预训练的pytorch 模型带有自定义预测例程的AI平台.按照此处所述的说明进行部署之后，部署失败并显示以下内容错误:

I am trying to deploy a pretrained pytorch model to AI Platform with a custom prediction routine. After following the instructions described here the deployment fails with the following error:

ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.

模型文件夹的内容大了 83.89 MB ，并且低于文档中所述的 250 MB 限制.文件夹中唯一的文件是模型的检查点文件(.pth)和自定义预测例程所需的tarball.

The contents of the model folder are 83.89 MB large and are below the 250 MB limit that's described in the documentation. The only files in the folder are the checkpoint file (.pth) for the model and the tarball required for the custom prediction routine.

创建模型的命令:

gcloud beta ai-platform versions create pose_pytorch --model pose --runtime-version 1.15 --python-version 3.5 --origin gs://rcg-models/pytorch_pose_estimation --package-uris gs://rcg-models/pytorch_pose_estimation/my_custom_code-0.1.tar.gz --prediction-class predictor.MyPredictor

将运行时版本更改为1.14会导致相同的错误. 我已经尝试将Partition建议的--machine-type参数更改为mls1-c4-m2，但是仍然出现相同的错误.

Changing the runtime version to 1.14 leads to the same error. I have tried changing the --machine-type argument to mls1-c4-m2 like Parth suggested but I still get the same error.

生成my_custom_code-0.1.tar.gz的setup.py文件如下所示:

setup(
    name='my_custom_code',
    version='0.1',
    scripts=['predictor.py'],
    install_requires=["opencv-python", "torch"]
)

预测变量的相关代码段:

Relevant code snippet from the predictor:

    def __init__(self, model):
        """Stores artifacts for prediction. Only initialized via `from_path`.
        """
        self._model = model
        self._client = storage.Client()

    @classmethod
    def from_path(cls, model_dir):
        """Creates an instance of MyPredictor using the given path.

        This loads artifacts that have been copied from your model directory in
        Cloud Storage. MyPredictor uses them during prediction.

        Args:
            model_dir: The local directory that contains the trained Keras
                model and the pickled preprocessor instance. These are copied
                from the Cloud Storage model directory you provide when you
                deploy a version resource.

        Returns:
            An instance of `MyPredictor`.
        """

        net = PoseEstimationWithMobileNet()
        checkpoint_path = os.path.join(model_dir, "checkpoint_iter_370000.pth")
        checkpoint = torch.load(checkpoint_path, map_location='cpu')
        load_state(net, checkpoint)

        return cls(net)

此外，我已经在AI平台中启用了该模型的日志记录，并且得到以下输出:

Additionally I have enabled logging for the model in AI Platform and I get the following outputs:

2019-12-17T09:28:06.208537Z OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k 
2019-12-17T09:28:13.474653Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:48: The name tf.saved_model.tag_constants.SERVING is deprecated. Please use tf.saved_model.SERVING instead. 
2019-12-17T09:28:13.474680Z {"textPayload":"","insertId":"5df89fad00073e383ced472a","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474680Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:13.474807Z WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:50: The name tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY is deprecated. Please use tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY instead. 
2019-12-17T09:28:13.474829Z {"textPayload":"","insertId":"5df89fad00073ecd4836d6aa","resource":{"type":"cloudml_model_version","labels":{"project_id":"rcg-shopper","region":"","version_id":"lightweight_pose_pytorch","model_id":"pose"}},"timestamp":"2019-12-17T09:28:13.474829Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:13.474918Z WARNING:tensorflow: 
2019-12-17T09:28:13.474927Z The TensorFlow contrib module will not be included in TensorFlow 2.0. 
2019-12-17T09:28:13.474934Z For more information, please see: 
2019-12-17T09:28:13.474941Z   * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md 
2019-12-17T09:28:13.474951Z   * https://github.com/tensorflow/addons 
2019-12-17T09:28:13.474958Z   * https://github.com/tensorflow/io (for I/O related ops) 
2019-12-17T09:28:13.474964Z If you depend on functionality not listed there, please file an issue. 
2019-12-17T09:28:13.474999Z {"textPayload":"","insertId":"5df89fad00073f778735d7c3","resource":{"type":"cloudml_model_version","labels":{"version_id":"lightweight_pose_pytorch","model_id":"pose","project_id":"rcg-shopper","region":""}},"timestamp":"2019-12-17T09:28:13.474999Z","logName":"projects/rcg-shopper/logs/ml.googleapis… 
2019-12-17T09:28:15.283483Z ERROR:root:Failed to import GA GRPC module. This is OK if the runtime version is 1.x 
2019-12-17T09:28:16.890923Z Copying gs://cml-489210249453-1560169483791188/models/pose/lightweight_pose_pytorch/15316451609316207868/user_code/my_custom_code-0.1.tar.gz... 
2019-12-17T09:28:16.891150Z / [0 files][    0.0 B/  8.4 KiB]                                                 
2019-12-17T09:28:17.007684Z / [1 files][  8.4 KiB/  8.4 KiB]                                                 
2019-12-17T09:28:17.009154Z Operation completed over 1 objects/8.4 KiB.                                       
2019-12-17T09:28:18.953923Z Processing /tmp/custom_code/my_custom_code-0.1.tar.gz 
2019-12-17T09:28:19.808897Z Collecting opencv-python 
2019-12-17T09:28:19.868579Z   Downloading https://files.pythonhosted.org/packages/d8/38/60de02a4c9013b14478a3f681a62e003c7489d207160a4d7df8705a682e7/opencv_python-4.1.2.30-cp37-cp37m-manylinux1_x86_64.whl (28.3MB) 
2019-12-17T09:28:21.537989Z Collecting torch 
2019-12-17T09:28:21.552871Z   Downloading https://files.pythonhosted.org/packages/f9/34/2107f342d4493b7107a600ee16005b2870b5a0a5a165bdf5c5e7168a16a6/torch-1.3.1-cp37-cp37m-manylinux1_x86_64.whl (734.6MB) 
2019-12-17T09:28:52.401619Z Collecting numpy>=1.14.5 
2019-12-17T09:28:52.412714Z   Downloading https://files.pythonhosted.org/packages/9b/af/4fc72f9d38e43b092e91e5b8cb9956d25b2e3ff8c75aed95df5569e4734e/numpy-1.17.4-cp37-cp37m-manylinux1_x86_64.whl (20.0MB) 
2019-12-17T09:28:53.550662Z Building wheels for collected packages: my-custom-code 
2019-12-17T09:28:53.550689Z   Building wheel for my-custom-code (setup.py): started 
2019-12-17T09:28:54.212558Z   Building wheel for my-custom-code (setup.py): finished with status 'done' 
2019-12-17T09:28:54.215365Z   Created wheel for my-custom-code: filename=my_custom_code-0.1-cp37-none-any.whl size=7791 sha256=fd9ecd472a6a24335fd24abe930a4e7d909e04bdc4cf770989143d92e7023f77 
2019-12-17T09:28:54.215482Z   Stored in directory: /tmp/pip-ephem-wheel-cache-i7sb0bmb/wheels/0d/6e/ba/bbee16521304fc5b017fa014665b9cae28da7943275a3e4b89 
2019-12-17T09:28:54.222017Z Successfully built my-custom-code 
2019-12-17T09:28:54.650218Z Installing collected packages: numpy, opencv-python, torch, my-custom-code

推荐答案

这是一个常见问题，我们知道这是一个痛点.请执行以下操作:

This is a common problem and we understand this is a pain point. Please do the following:

torchvision具有torch作为依赖项，默认情况下，它从pypi中提取torch.

torchvision has torch as dependency and by default, it pulls torch from pypi.

在部署模型时，即使您指向使用自定义ai平台torchvision包，它也会做到这一点，因为torchvision是由PyTorch团队构建的，因此配置为使用torch作为依赖项. pypi的torch依赖项提供了720mb的文件，因为它包含GPU单元

When deploying the model, even if you point to use custom ai-platform torchvision packages it will do it, since torchvision when is built by PyTorch team, it is configured to use torch as dependency. This torch dependency from pypi, gives a 720mb file because it includes the GPU units

要解决#1，您需要构建从源头开始，并告诉torchvision您要从何处获取torch，由于包装较小，您需要将其设置为转到torch网站.使用Python PEP-0440直接引用重建torchvision二进制文件功能.在torchvision setup.py 中，我们有:

To solve #1, you need to build torchvision from source and tell torchvision where you want to get torch from, you need to set it to go to the torch website as the package is smaller. Rebuild the torchvision binary using Python PEP-0440 direct references feature. In torchvision setup.py we have:

pytorch_dep = 'torch'
if os.getenv('PYTORCH_VERSION'):
    pytorch_dep += "==" + os.getenv('PYTORCH_VERSION')

更新torchvision中的setup.py以使用直接引用功能:

Update setup.py in torchvision to use direct references feature:

requirements = [
     #'numpy',
     #'six',
     #pytorch_dep,
     'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl'
]

*我已经为您完成了此操作* ，因此我建立了3个Wheel文件供您使用:

* I already did this for you*, so I build 3 wheel files you can use:

gs://dpe-sandbox/torchvision-0.4.0-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.0)
gs://dpe-sandbox/torchvision-0.4.2-cp37-cp37m-linux_x86_64.whl (torch 1.2.0, vision 0.4.2)
gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl (torch 1.4.0  vision 0.5.0)

这些torchvision软件包将从火炬网站获取torch而不是pypi :(例如:

These torchvision packages will get torch from the torch site instead of pypi: (Example: https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl)

在将模型部署到AI平台时更新模型setup.py，因此它不包含torch或torchvision.

按如下所示重新部署模型:

Redeploy the model as follows:

PYTORCH_VISION_PACKAGE=gs://dpe-sandbox/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl

gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} \
            --origin=gs://{BUCKET}/{GCS_MODEL_DIR} \
            --python-version=3.7 \
            --runtime-version={RUNTIME_VERSION} \
            --machine-type=mls1-c4-m4 \
            --package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI},{PYTORCH_VISION_PACKAGE}\
            --prediction-class={MODEL_CLASS}

您可以将PYTORCH_VISION_PACKAGE更改为我在#2中提到的任何选项

You can change the PYTORCH_VISION_PACKAGE to any of the options I mentioned in #2

这篇关于无法使用自定义的预测例程将经过训练的模型部署到Google Cloud Ai平台:模型所需的内存超出了允许的范围的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法使用自定义的预测例程将经过训练的模型部署到Google Cloud Ai平台:模型所需的内存超出了允许的范围 [英] Cannot deploy trained model to Google Cloud Ai-Platform with custom prediction routine: Model requires more memory than allowed

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

无法使用自定义的预测例程将经过训练的模型部署到Google Cloud Ai平台:模型所需的内存超出了允许的范围 [英] Cannot deploy trained model to Google Cloud Ai-Platform with custom prediction routine: Model requires more memory than allowed

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭