无法在MLEngineTrainingOperator中指定master_type [英] unable to specify master_type in MLEngineTrainingOperator
问题描述
我正在使用气流来调度管道,这将导致使用ai平台训练scikitlearn模型.我用这个DAG训练它
I am using airflow to schedule a pipeline that will result in training a scikitlearn model with ai platform. I use this DAG to train it
with models.DAG(JOB_NAME,
schedule_interval=None,
default_args=default_args) as dag:
# Tasks definition
training_op = MLEngineTrainingOperator(
task_id='submit_job_for_training',
project_id=PROJECT,
job_id=job_id,
package_uris=[os.path.join(TRAINER_BIN)],
training_python_module=TRAINER_MODULE,
runtime_version=RUNTIME_VERSION,
region='europe-west1',
training_args=[
'--base-dir={}'.format(BASE_DIR),
'--event-date=20200212',
],
python_version='3.5')
training_op
培训包将加载所需的csv文件,并在其上训练RandomForestClassifier.
The training package loads the desired csv files and train a RandomForestClassifier on it.
这很好,直到文件的数量和大小增加为止.然后我得到这个错误:
This works fine until the number and the size of the files increase. Then I get this error:
ERROR - The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL). To find out more about why your job exited please check the logs:
文件的总大小约为4 Gb.我不知道使用的默认计算机是什么,但似乎还不够.希望这可以解决内存消耗问题,我尝试将分类器的参数n_jobs
从-1
更改为1
,没有更多的运气.
The total size of the files is around 4 Gb. I dont know what is the default machine used but is seems not enough. Hoping this would solve the memory consumption issue I tried to change the parameter n_jobs
of the classifier from -1
to 1
, with no more luck.
看看MLEngineTrainingOperator的代码和文档,我添加了一个自定义的scale_tier和一个master_type n1-highmem-8、8个CPU和52GB的RAM,像这样:
Looking at the code of MLEngineTrainingOperator and the documentation I added a custom scale_tier and a master_type n1-highmem-8, 8 CPUs and 52GB of RAM , like this:
with models.DAG(JOB_NAME,
schedule_interval=None,
default_args=default_args) as dag:
# Tasks definition
training_op = MLEngineTrainingOperator(
task_id='submit_job_for_training',
project_id=PROJECT,
job_id=job_id,
package_uris=[os.path.join(TRAINER_BIN)],
training_python_module=TRAINER_MODULE,
runtime_version=RUNTIME_VERSION,
region='europe-west1',
master_type="n1-highmem-8",
scale_tier="custom",
training_args=[
'--base-dir={}'.format(BASE_DIR),
'--event-date=20200116',
],
python_version='3.5')
training_op
这导致了另一个错误:
ERROR - <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/MY_PROJECT/jobs?alt=json returned "Field: master_type Error: Master type must be specified for the CUSTOM scale tier.">
我不知道哪里出了问题,但看来这不是解决问题的方法.
I don't know what is wrong but it appears that is not the way to do that.
使用命令行,我设法启动该作业:
Using command line I manage to launch the job:
gcloud ai-platform jobs submit training training_job_name --packages=gs://path/to/package/package.tar.gz --python-version=3.5 --region=europe-west1 --runtime-version=1.14 --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16
但是我想在气流中做到这一点.
However i would like to do this in airflow.
任何帮助将不胜感激.
我的环境使用的是Apache Airflow的旧版本1.10.3,其中master_type参数不存在. 将版本更新到1.10.6解决了此问题
My environment used an old version of apache airflow, 1.10.3 where the master_type argument was not present. Updating the version to 1.10.6 solved this issue
推荐答案
我的环境使用的是Apache Airflow的旧版本1.10.3,其中master_type参数不存在.将版本更新到1.10.6解决了此问题
My environment used an old version of apache airflow, 1.10.3 where the master_type argument was not present. Updating the version to 1.10.6 solved this issue
这篇关于无法在MLEngineTrainingOperator中指定master_type的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!