无法在Google Cloud中训练我的Tensorflow检测器模型 [英] Can't train my Tensorflow detector model in Google Cloud

查看:140
本文介绍了无法在Google Cloud中训练我的Tensorflow检测器模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于Tensorflow示例和

I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25s per iteration).

这样,我尝试按照教程,但我无法使其正常运行.

This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.

我的文件夹结构如下:

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

我从本地培训转变为Google Cloud培训的步骤是:

My steps to change from local training to Google Cloud training were:

  1. 在Google Cloud存储中创建存储桶,并用文件复制我的本地文件夹结构;
  2. 编辑我的pipeline.config文件,并将所有路径从Users/dev/detector/更改为gcc://bucketname/;
  3. 使用教程中提供的默认配置创建YAML文件;
  4. 运行

  1. Create a bucket in Google Cloud storage and copy my local folder structure with files;
  2. Edit my pipeline.config file and change all paths from Users/dev/detector/ to gcc://bucketname/;
  3. Create a YAML file with the default configuration provided in the tutorial;
  4. Run

gcloud ml-engine作业提交培训object_detection_ date +%s \ --job-dir = gs://bucketname/models/train \ -打包dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-east1 \ --config/用户/dev/detector/training/cloud.yml \ -\ --train_dir = gs://bucketname/models/train \ --pipeline_config_path = gs://bucketname/data/pipeline.config

gcloud ml-engine jobs submit training object_detection_date +%s \ --job-dir=gs://bucketname/models/train \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-east1 \ --config /Users/dev/detector/training/cloud.yml \ -- \ --train_dir=gs://bucketname/models/train \ --pipeline_config_path=gs://bucketname/data/pipeline.config

这样做,会给我以下MLUnits错误消息:

Doing so, gives me the following error message from the MLUnits:

副本ps 0以非零状态1退出.终止原因:错误.追溯(最近一次调用):文件"/usr/lib/python2.7/runpy.py",第162行,位于_run_module_as_main"__ main__"中,fname,loader,pkg_name)文件"/usr/lib/python2.7/来自run_globals文件"/root/.local/lib/python2.7/site-packages/object_detection/train.py"的_run_code exec代码中第72行,位于object_detection导入培训器文件中的"/". root/.local/lib/python2.7/site-packages/object_detection/trainer.py,行27,来自object_detection.builders导入preprocessor_builder文件"/root/.local/lib/python2.7/site-packages/来自object_detection.protos的object_detection/builders/preprocessor_builder.py,第21行,导入preprocessor_pb2文件"/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py",第71行,在options = None,file = DESCRIPTOR),TypeError:__new __()获得了意外的关键字参数'file'

谢谢.

推荐答案

问题是protobuf版本.您可能已经通过brew安装了最新的协议.和protobuf(自3.5.0版开始)添加了file字段 https://github.com/google/protobuf/blob/9f80df026933901883da1d556b38292e14836612/CHANGES.txt#L74

The issue is the protobuf version. You probably have installed via brew the latest protoc; and protobuf since version 3.5.0 added the file field https://github.com/google/protobuf/blob/9f80df026933901883da1d556b38292e14836612/CHANGES.txt#L74

因此,在上述更改的顶部,在REQUIRED_PACKAGES中将protobuf版本设置为'protobuf>=3.5.1'

So in top of the changes above, in REQUIRED_PACKAGES set protobuf version to 'protobuf>=3.5.1'

这篇关于无法在Google Cloud中训练我的Tensorflow检测器模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆