无法在Google Cloud中训练我的Tensorflow检测器模型 [英] Can't train my Tensorflow detector model in Google Cloud
问题描述
I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25s per iteration).
这样,我尝试按照教程,但我无法使其正常运行.
This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.
我的文件夹结构如下:
+ data
- train.record
- test.record
+ models
+ train
+ eval
+ training
- ssd_mobilenet_v1_coco
我从本地培训转变为Google Cloud培训的步骤是:
My steps to change from local training to Google Cloud training were:
- 在Google Cloud存储中创建存储桶,并用文件复制我的本地文件夹结构;
- 编辑我的
pipeline.config
文件,并将所有路径从Users/dev/detector/
更改为gcc://bucketname/
; - 使用教程中提供的默认配置创建YAML文件;
-
运行
- Create a bucket in Google Cloud storage and copy my local folder structure with files;
- Edit my
pipeline.config
file and change all paths fromUsers/dev/detector/
togcc://bucketname/
; - Create a YAML file with the default configuration provided in the tutorial;
Run
gcloud ml-engine作业提交培训object_detection_ date +%s
\
--job-dir = gs://bucketname/models/train \
-打包dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config/用户/dev/detector/training/cloud.yml \
-\
--train_dir = gs://bucketname/models/train \
--pipeline_config_path = gs://bucketname/data/pipeline.config
gcloud ml-engine jobs submit training object_detection_date +%s
\
--job-dir=gs://bucketname/models/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/dev/detector/training/cloud.yml \
-- \
--train_dir=gs://bucketname/models/train \
--pipeline_config_path=gs://bucketname/data/pipeline.config
这样做,会给我以下MLUnits错误消息:
Doing so, gives me the following error message from the MLUnits:
副本ps 0以非零状态1退出.终止原因:错误.追溯(最近一次调用):文件"/usr/lib/python2.7/runpy.py",第162行,位于_run_module_as_main"__ main__"中,fname,loader,pkg_name)文件"/usr/lib/python2.7/来自run_globals文件"/root/.local/lib/python2.7/site-packages/object_detection/train.py"的_run_code exec代码中第72行,位于object_detection导入培训器文件中的"/". root/.local/lib/python2.7/site-packages/object_detection/trainer.py,行27,来自object_detection.builders导入preprocessor_builder文件"/root/.local/lib/python2.7/site-packages/来自object_detection.protos的object_detection/builders/preprocessor_builder.py,第21行,导入preprocessor_pb2文件"/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py",第71行,在options = None,file = DESCRIPTOR),TypeError:__new __()获得了意外的关键字参数'file'
谢谢.
推荐答案
问题是protobuf版本.您可能已经通过brew安装了最新的协议.和protobuf(自3.5.0版开始)添加了file
字段 https://github.com/google/protobuf/blob/9f80df026933901883da1d556b38292e14836612/CHANGES.txt#L74
The issue is the protobuf version. You probably have installed via brew the latest protoc; and protobuf since version 3.5.0 added the file
field https://github.com/google/protobuf/blob/9f80df026933901883da1d556b38292e14836612/CHANGES.txt#L74
因此,在上述更改的顶部,在REQUIRED_PACKAGES
中将protobuf版本设置为'protobuf>=3.5.1'
So in top of the changes above, in REQUIRED_PACKAGES
set protobuf version to 'protobuf>=3.5.1'
这篇关于无法在Google Cloud中训练我的Tensorflow检测器模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!