我如何将 tensorboard 与 aws sagemaker tensorflow 一起使用? [英] how can i use tensorboard with aws sagemaker tensorflow?

查看:21
本文介绍了我如何将 tensorboard 与 aws sagemaker tensorflow 一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始了一个贤者工作:

i have started a sagemaker job:

from sagemaker.tensorflow import TensorFlow
mytraining= TensorFlow(entry_point='model.py',
                        role=role,
                        train_instance_count=1,
                        train_instance_type='ml.p2.xlarge',
                        framework_version='2.0.0',
                        py_version='py3',
                        distributions={'parameter_server'{'enabled':False}})

training_data_uri ='s3://path/to/my/data'
mytraining.fit(training_data_uri,run_tensorboard_locally=True)

使用 run_tesorboard_locally=True 给了我

Tensorboard is not supported with script mode. You can run the following command: tensorboard --logdir None --host localhost --port 6006 This can be run from anywhere with access to the S3 URI used as the logdir.

好像我不能使用它的脚本模式,但我可以在 s3 中访问 tensorboard 的日志?但是s3中的日志在哪里?

It seems like i cant use it script mode, but I can access the logs of tensorboard in s3? But where are the logs in s3?

def _parse_args():
    parser = argparse.ArgumentParser()

    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--model_dir', type=str)
    parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
    parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))

    return parser.parse_known_args()

if __name__ == "__main__":
    args, unknown = _parse_args()

    train_data, train_labels = load_training_data(args.train)
    eval_data, eval_labels = load_testing_data(args.train)

    mymodel= model(train_data, train_labels, eval_data, eval_labels)

    if args.current_host == args.hosts[0]:
        mymodel.save(os.path.join(args.sm_model_dir, '000000002/model.h5'))

类似的问题在这里:stack

编辑我尝试了这个新配置,但它不起作用.

EDIT i tried this new config but it doesnt work.

 tensorboard_output_config = TensorBoardOutputConfig( s3_output_path='s3://PATH/to/my/bucket')

mytraining= TensorFlow(entry_point='model.py',
                        role=role,
                        train_instance_count=1,
                        train_instance_type='ml.p2.xlarge',
                        framework_version='2.0.0',
                        py_version='py3',
                        distributions={'parameter_server': {'enabled':False}},
                        tensorboard_output_config=tensorboard_output_config)

我在我的 model.py 脚本中添加了回调,这实际上是我在没有 sagemaker 的情况下使用的.作为 logdir,我定义了默认目录,TensoboardOutputConfig 在其中写入数据......但它不起作用.docs 我也使用它而没有回调.

i added the callback in my model.py script that is actually what i use without sagemaker. As logdir i defined the default dir, where the TensoboardOutputConfig writes the data.. but it doesnt work. docs I also used it without the callback.

 tensorboardCallback = tf.keras.callbacks.TensorBoard(
        log_dir='/opt/ml/output/tensorboard',
        histogram_freq=0,
        # batch_size=32,ignored tf.2.0
        write_graph=True,
        write_grads=False,
        write_images=False,
        embeddings_freq=0,
        embeddings_layer_names=None,
        embeddings_metadata=None,
        embeddings_data=None,
        update_freq='batch') 

推荐答案

难以调试您的情况的确切根本原因,但以下步骤对我有用.我在笔记本实例中手动启动了 tensorboard.

Difficult to debug what the exact root cause is in your case, but following steps worked for me. I started tensorboard inside the notebook instance manually.

  1. 遵循关于sagemaker 调试 为张量板日志配置 S3 输出路径.

  1. Followed guide on sagemaker debugging to configure the S3 output path for tensorboard logs.

from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_output_config = TensorBoardOutputConfig(
       s3_output_path = 's3://bucket-name/tensorboard_log_folder/'
)

estimator = TensorFlow(entry_point='train.py',
               source_dir='./',
               model_dir=model_dir,
               output_path= output_dir,
               train_instance_type=train_instance_type,
               train_instance_count=1,
               hyperparameters=hyperparameters,
               role=sagemaker.get_execution_role(),
               base_job_name='Testing-TrainingJob',
               framework_version='2.2',
               py_version='py37',
               script_mode=True,
               tensorboard_output_config=tensorboard_output_config)

estimator.fit(inputs)

  • 通过笔记本实例上的终端使用上面提供的 S3 位置启动张量板.

    $ tensorboard --logdir 's3://bucket-name/tensorboard_log_folder/'
    

  • 通过带有 /proxy/6006/ 的 URL 访问板.您需要更新以下 URL 中的笔记本实例详细信息.

  • Access the board via URL with /proxy/6006/. You need to update the notebook instance details in the following URL.

    https://myinstance.notebook.us-east-1.sagemaker.aws/proxy/6006/
    

  • 这篇关于我如何将 tensorboard 与 aws sagemaker tensorflow 一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆