我如何在AWS Sagemaker TensorFlow中使用Tensorboard? [英] how can i use tensorboard with aws sagemaker tensorflow?

查看:120
本文介绍了我如何在AWS Sagemaker TensorFlow中使用Tensorboard?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始了圣人工作:

i have started a sagemaker job:

from sagemaker.tensorflow import TensorFlow
mytraining= TensorFlow(entry_point='model.py',
                        role=role,
                        train_instance_count=1,
                        train_instance_type='ml.p2.xlarge',
                        framework_version='2.0.0',
                        py_version='py3',
                        distributions={'parameter_server'{'enabled':False}})

training_data_uri ='s3://path/to/my/data'
mytraining.fit(training_data_uri,run_tensorboard_locally=True)

使用 run_tesorboard_locally = True 给了我

Tensorboard is not supported with script mode. You can run the following command: tensorboard --logdir None --host localhost --port 6006 This can be run from anywhere with access to the S3 URI used as the logdir.

似乎我不能使用它的脚本模式,但是我可以在s3中访问tensorboard的日志吗?但是s3中的日志在哪里?

It seems like i cant use it script mode, but I can access the logs of tensorboard in s3? But where are the logs in s3?

def _parse_args():
    parser = argparse.ArgumentParser()

    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--model_dir', type=str)
    parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
    parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))

    return parser.parse_known_args()

if __name__ == "__main__":
    args, unknown = _parse_args()

    train_data, train_labels = load_training_data(args.train)
    eval_data, eval_labels = load_testing_data(args.train)

    mymodel= model(train_data, train_labels, eval_data, eval_labels)

    if args.current_host == args.hosts[0]:
        mymodel.save(os.path.join(args.sm_model_dir, '000000002/model.h5'))

类似的问题在这里:堆栈

编辑,我尝试了这个新配置,但是它不起作用.

EDIT i tried this new config but it doesnt work.

 tensorboard_output_config = TensorBoardOutputConfig( s3_output_path='s3://PATH/to/my/bucket')

mytraining= TensorFlow(entry_point='model.py',
                        role=role,
                        train_instance_count=1,
                        train_instance_type='ml.p2.xlarge',
                        framework_version='2.0.0',
                        py_version='py3',
                        distributions={'parameter_server': {'enabled':False}},
                        tensorboard_output_config=tensorboard_output_config)

我在我的model.py脚本中添加了回调,这实际上是我在不使用sagemaker的情况下使用的回调.作为日志目录,我定义了默认目录,即TensoboardOutputConfig写入数据的位置..但是它不起作用. docs 我也使用了没有回调的

i added the callback in my model.py script that is actually what i use without sagemaker. As logdir i defined the default dir, where the TensoboardOutputConfig writes the data.. but it doesnt work. docs I also used it without the callback.

 tensorboardCallback = tf.keras.callbacks.TensorBoard(
        log_dir='/opt/ml/output/tensorboard',
        histogram_freq=0,
        # batch_size=32,ignored tf.2.0
        write_graph=True,
        write_grads=False,
        write_images=False,
        embeddings_freq=0,
        embeddings_layer_names=None,
        embeddings_metadata=None,
        embeddings_data=None,
        update_freq='batch') 

推荐答案

在您的情况下,很难调试出确切的根本原因,但是以下步骤对我有用.我在笔记本实例内部手动启动了tensorboard.

Difficult to debug what the exact root cause is in your case, but following steps worked for me. I started tensorboard inside the notebook instance manually.

  1. 有关 sagemaker调试为张量板日志配置 S3 输出路径.

  1. Followed guide on sagemaker debugging to configure the S3 output path for tensorboard logs.

from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_output_config = TensorBoardOutputConfig(
       s3_output_path = 's3://bucket-name/tensorboard_log_folder/'
)

estimator = TensorFlow(entry_point='train.py',
               source_dir='./',
               model_dir=model_dir,
               output_path= output_dir,
               train_instance_type=train_instance_type,
               train_instance_count=1,
               hyperparameters=hyperparameters,
               role=sagemaker.get_execution_role(),
               base_job_name='Testing-TrainingJob',
               framework_version='2.2',
               py_version='py37',
               script_mode=True,
               tensorboard_output_config=tensorboard_output_config)

estimator.fit(inputs)

  • 通过笔记本实例上的终端,使用上面提供的 S3 位置启动张量板.

    $ tensorboard --logdir 's3://bucket-name/tensorboard_log_folder/'
    

  • 使用/proxy/6006/通过URL访问开发板.您需要在以下URL中更新笔记本实例的详细信息.

  • Access the board via URL with /proxy/6006/. You need to update the notebook instance details in the following URL.

    https://myinstance.notebook.us-east-1.sagemaker.aws/proxy/6006/
    

  • 这篇关于我如何在AWS Sagemaker TensorFlow中使用Tensorboard?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆