Keras Google Cloudml示例:IndexError [英] Keras google cloudml sample: IndexError
问题描述
我正在尝试keras cloudml示例( https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras ),我似乎无法进行云培训.使用python和gcloud进行的本地培训似乎进展顺利.
I'm trying the keras cloudml sample (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras) and I seem unable to run the cloud training. The local training, both with python and gcloud seem to go well.
我一直在寻找关于stackexchange,Google的解决方案,并阅读 https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting ,但我似乎是唯一遇到此问题的人(通常有力的迹象表明故障完全是我的!).除了下面的环境,我还尝试使用python 3.6和tensorflow 1.3并没有成功.
I've looked for a solution on stackexchange, google and read https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting, but I seem to be the only one with this problem (usually a strong indication the fault is entirely mine!) . In addition to the environment below, I've tried with python 3.6 and tensorflow 1.3 with no success.
我是菜鸟,所以我可能以某种基本的方式犯了错误,但我找不到它.
I'm a noob, so I'm probably erring in some basic way, but I cannot spot it.
感谢所有帮助,
:-)
yarc68000.
yarc68000.
-环境-
(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py 2.7.1 py27_1 conda-forge
keras 2.0.6 py27_0 conda-forge
numexpr 2.6.2 py27_1 conda-forge
pandas 0.20.3 py27_0 anaconda
tensorflow 1.2.1 <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud
gsutil 4.27
-----------工作--------
----------- job --------
(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO 2017-09-22 19:56:56 +0200 service Validating job requirements...
INFO 2017-09-22 19:56:57 +0200 service Job creation request has been successfully validated.
INFO 2017-09-22 19:56:57 +0200 service Job census_keras_7639 is queued.
INFO 2017-09-22 19:56:57 +0200 service Waiting for job to be provisioned.
INFO 2017-09-22 20:01:39 +0200 service Waiting for TensorFlow to start.
INFO 2017-09-22 20:02:55 +0200 master-replica-0 Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO 2017-09-22 20:04:00 +0200 master-replica-0 197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO 2017-09-22 20:04:00 +0200 master-replica-0 200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600
INFO 2017-09-22 20:04:00 +0200 master-replica-0 Epoch 10/20
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 Traceback (most recent call last):
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 "__main__", fname, loader, pkg_name)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 exec code in run_globals
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 dispatch(**parse_args.__dict__)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks=callbacks)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 initial_epoch=initial_epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks.on_epoch_begin(epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callback.on_epoch_begin(epoch, logs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 census_model = load_model(checkpoints[-1])
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 IndexError: list index out of range
<..>
INFO 2017-09-22 20:04:53 +0200 service Finished tearing down TensorFlow.
INFO 2017-09-22 20:05:49 +0200 service Job failed.
推荐答案
在Cloud ML Engine上运行此漏洞时实际上存在一个错误,因为目前在GCS上禁用了检查点(Keras无法将检查点本地写入GCS) .请参阅此 PR ,以获取针对您所面临问题的即时解决方案.还可以查看待定PR ,它可以解决检查点问题并使文件可用关于GCS(无法为Keras编写GCS的解决方法).
There actually was a bug when running this on the Cloud ML Engine because the checkpoints are disabled for now on GCS (Keras can't natively write checkpoints to GCS). See this PR for the immediate fix for the issue you are facing. Also take a look at pending PR which fixes the checkpoint issue and makes files available on GCS (Workaround for the inability to do GCS writes for Keras).
这篇关于Keras Google Cloudml示例:IndexError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!