“在适应的标准ML-Engine对象检测训练上,由对等方重置连接 [英] "Connection reset by peer on adapted standard ML-Engine object-detection training

查看:135
本文介绍了“在适应的标准ML-Engine对象检测训练上,由对等方重置连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是基于对象检测" API中的pet-training示例,使用Google ML-Engine测试自定义的对象检测训练.

My goal is to test a custom object-detection training using the Google ML-Engine based on the pet-training example from the Object Detection API.

经过一些成功的训练周期(可能直到第一个检查点,因为尚未创建检查点)...

After some successful training cycles (maybe until the first checkpoint, since no checkpoint has been created) ...

15:46:56.784 全局步骤2257:损耗= 0.7767(1.70秒/步)

15:46:56.784 global step 2257: loss = 0.7767 (1.70 sec/step)

15:46:56.821 全局步骤2258:损耗= 1.3547(1.13秒/步)

15:46:56.821 global step 2258: loss = 1.3547 (1.13 sec/step)

...我在几个对象检测培训工作试用中收到以下错误:

... I received following error on several object detection training job trials:

向协调器报告错误:,{创建":"@ 1502286418.246034567",说明":"OS错误","errno":104,文件":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":229,"grpc_status":14,"os_error:"对等重置连接," syscall:" recvmsg}

Error reported to Coordinator: , {"created":"@1502286418.246034567","description":"OS Error","errno":104,"file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":229,"grpc_status":14,"os_error":"Connection reset by peer","syscall":"recvmsg"}

我在worker-replica-0,3和4上收到它.此后工作失败:

I received it on worker-replica-0,3 and 4. Afterwards the job fails:

命令'['python','-m',u'object_detection.train',u'-train_dir = gs://cartrainingbucket/train',u'-pipeline_config_path = gs://cartrainingbucket/data/faster_rcnn_resnet101.config','--job-dir',u'gs://cartrainingbucket/train']'返回非零退出状态-9

Command '['python', '-m', u'object_detection.train', u'--train_dir=gs://cartrainingbucket/train', u'--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config', '--job-dir', u'gs://cartrainingbucket/train']' returned non-zero exit status -9

我使用的是faster_rcnn_resnet101.config的改编版,具有以下更改:

I'm using an adaptation of the faster_rcnn_resnet101.config, with following changes:

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://cartrainingbucket/data/vehicle_train.record"
  }
  label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
}

eval_config: {
  num_examples: 2000
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://cartrainingbucket/data/vehicle_val.record"
  }
  label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

我的存储桶看起来像这样:

My bucket looks like this:

cartrainingbucket (Regional US-CENTRAL1)
--data/
  --faster_rcnn_resnet101.config
  --vehicle_label_map.pbtxt
  --vehicle_train.record
  --vehicle_val.record
--train/ 
  --checkpoint
  --events.out.tfevents.1502259105.master-556a4f538e-0-tmt52
  --events.out.tfevents.1502264231.master-d3b4c71824-0-2733w
  --events.out.tfevents.1502267118.master-7f8d859ac5-0-r5h8s
  --events.out.tfevents.1502282824.master-acb4b4f78d-0-9d1mw
  --events.out.tfevents.1502285815.master-1ef3af1094-0-lh9dx
  --graph.pbtxt
  --model.ckpt-0.data-00000-of-00001
  --model.ckpt-0.index
  --model.ckpt-0.meta
  --packages/

我使用以下命令运行作业(使用Windows cmd [^应该等于]:

I run the job using following command (using a windows cmd [^ should equal ]:

gcloud ml-engine jobs submit training stefan_object_detection_09_08_2017i ^
--job-dir=gs://cartrainingbucket/train ^
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz ^
--module-name object_detection.train ^
--region us-central1 ^
--config object_detection/samples/cloud/cloud.yml ^
-- ^
--train_dir=gs://cartrainingbucket/train ^
--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config

cloud.yml是默认设置:

the cloud.yml is the default one:

trainingInput:
  runtimeVersion: "1.0" # i also tried 1.2, in this case the failure appeared earlier in training
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

我正在使用当前最新的Tensorflow模型主分支版本(commit 36​​203f09dc257569be2fef3a9​​50ddb2ac25dddeb).我本地安装的TF版本是1.2,我正在使用python 3.5.1.

I'm using the currently latest Tensorflow Model master branch version (commit 36203f09dc257569be2fef3a950ddb2ac25dddeb). My locally installed TF version is 1.2 and I'm using python 3.5.1.

我的培训和验证记录均在本地进行培训.

My training and validation records both work locally for training.

对我来说,作为新手,很难看到问题的根源.我很乐意提供任何建议.

For me, as a Newbie, it's hard to see the problem's source. I'd be happy for any advice.

推荐答案

更新: 由于内存不足,作业失败.请尝试使用更大的机器.

Update: The job failed due to out-of-memory. Try to use larger machine instead please.

除了rhaertel80的答案外,如果您可以通过cloudml-feedback@google.com与我们共享项目编号和职位ID,也会很有帮助.

In addition to rhaertel80's answer, it will be also helpful if you can share the project number and job id with us via cloudml-feedback@google.com.

这篇关于“在适应的标准ML-Engine对象检测训练上,由对等方重置连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆