提交TensorFlow估算器作为运行实验 [英] Submitting TensorFlow estimator as run to experiment

查看:69
本文介绍了提交TensorFlow估算器作为运行实验的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好。

我为模型创建了一个训练脚本,并且我在Azure ML Services中的计算群集上运行它。

I've created a training script for a model and i've run it on a compute cluster in Azure ML Services.

一切正常。但是现在我正在尝试将完全相同的设置移动到另一个Azure订阅。出于某种原因,当我提交运行时没有任何反应。是否有任何先决条件/权限需要提交运行。我可以毫无问题地创建计算
集群。

It works fine. However now i'm trying to move the exact same setup to another Azure subscription. For some reason when I submit the run nothing happens. Are there any prerequisites/rights that needs to be in order to submit runs. I can create the compute cluster without problems.

我正在使用带有以下代码的python SDK

I'm using the python SDK with the following code

from azureml.core.workspace import Workspace
import azureml.core
import os

ws = Workspace.from_config()
print('Workspace name: ' + ws.name,
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "gpucluster3"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           min_nodes=0,
                                                           max_nodes=2)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

print(compute_target.get_status().serialize())

from azureml.core import Experiment

experiment_name = 'test'
experiment = Experiment(ws, name=experiment_name)

from azureml.train.dnn import TensorFlow
script_params={'--data_dir': ds_data.as_mount()}
# I did not include all parameter definitions but they are defined
estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='train-hov.py',
                      pip_packages=['keras==2.1.2','h5py'],
                      node_count=2,
                      process_count_per_node=1,
                      distributed_backend='mpi',
                      use_gpu=True)
run = experiment.submit(estimator)

当我运行脚本时,它会卡在最后一行和实验中运行没有提交。

有没有办法让我调试这个?或者需要设置哪些资源提供者?

When I run the script it gets stuck at the last line and the experiment or run doesn't get submitted.
Is there a way for me to debug this? Or what resource providers or a like needs to be set?

希望你可以提供帮助

推荐答案

您好,

由于您要在新订阅中重新创建实验,我建议您检查以下内容:

Since you are re-creating the experiment in the new subscription I would suggest to check the following:

1。你有"STANDARD_NC6"吗?此订阅中提供的系列,因为它可能是一个不同的区域。可以找到此系列的可用区域列表

这里
。 

1. Do you have "STANDARD_NC6" series available in this subscription since it could be a different region. The list of available regions for this series can be found here

2。检查新订阅的订阅类型。如果是免费试用版,您可能无权使用所有资源。

2. Check what is the subscription type of the new subscription. If it is free trial you might not have access to use all resources.

3。检查容器中的安装是否成功。您可以从azure portal的容器日志中进行检查。这是一个类似的

issue
因外部数据存储无法成功挂载而失败。

3. Check if the mount is successful in your container. You can check this from the container logs from azure portal. Here is one similar issue that failed since external datastore could not be mounted successfully.

4 。还要检查是否所有参数
TensorFlow类
正确传递。 

4. Also check if all parameters to TensorFlow class are passed correctly. 


这篇关于提交TensorFlow估算器作为运行实验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆