我如何在AWS Sagemaker TensorFlow中使用Tensorboard? [英] how can i use tensorboard with aws sagemaker tensorflow?
问题描述
我已经开始了圣人工作:
i have started a sagemaker job:
from sagemaker.tensorflow import TensorFlow
mytraining= TensorFlow(entry_point='model.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p2.xlarge',
framework_version='2.0.0',
py_version='py3',
distributions={'parameter_server'{'enabled':False}})
training_data_uri ='s3://path/to/my/data'
mytraining.fit(training_data_uri,run_tensorboard_locally=True)
使用 run_tesorboard_locally = True
给了我
Tensorboard is not supported with script mode. You can run the following command: tensorboard --logdir None --host localhost --port 6006 This can be run from anywhere with access to the S3 URI used as the logdir.
似乎我不能使用它的脚本模式,但是我可以在s3中访问tensorboard的日志吗?但是s3中的日志在哪里?
It seems like i cant use it script mode, but I can access the logs of tensorboard in s3? But where are the logs in s3?
def _parse_args():
parser = argparse.ArgumentParser()
# Data, model, and output directories
# model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
parser.add_argument('--model_dir', type=str)
parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
return parser.parse_known_args()
if __name__ == "__main__":
args, unknown = _parse_args()
train_data, train_labels = load_training_data(args.train)
eval_data, eval_labels = load_testing_data(args.train)
mymodel= model(train_data, train_labels, eval_data, eval_labels)
if args.current_host == args.hosts[0]:
mymodel.save(os.path.join(args.sm_model_dir, '000000002/model.h5'))
类似的问题在这里:堆栈
编辑,我尝试了这个新配置,但是它不起作用.
EDIT i tried this new config but it doesnt work.
tensorboard_output_config = TensorBoardOutputConfig( s3_output_path='s3://PATH/to/my/bucket')
mytraining= TensorFlow(entry_point='model.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p2.xlarge',
framework_version='2.0.0',
py_version='py3',
distributions={'parameter_server': {'enabled':False}},
tensorboard_output_config=tensorboard_output_config)
我在我的model.py脚本中添加了回调,这实际上是我在不使用sagemaker的情况下使用的回调.作为日志目录,我定义了默认目录,即TensoboardOutputConfig写入数据的位置..但是它不起作用. docs 我也使用了没有回调的
i added the callback in my model.py script that is actually what i use without sagemaker. As logdir i defined the default dir, where the TensoboardOutputConfig writes the data.. but it doesnt work. docs I also used it without the callback.
tensorboardCallback = tf.keras.callbacks.TensorBoard(
log_dir='/opt/ml/output/tensorboard',
histogram_freq=0,
# batch_size=32,ignored tf.2.0
write_graph=True,
write_grads=False,
write_images=False,
embeddings_freq=0,
embeddings_layer_names=None,
embeddings_metadata=None,
embeddings_data=None,
update_freq='batch')
推荐答案
在您的情况下,很难调试出确切的根本原因,但是以下步骤对我有用.我在笔记本实例内部手动启动了tensorboard.
Difficult to debug what the exact root cause is in your case, but following steps worked for me. I started tensorboard inside the notebook instance manually.
-
有关 sagemaker调试为张量板日志配置
S3
输出路径.
Followed guide on sagemaker debugging to configure the
S3
output path for tensorboard logs.
from sagemaker.debugger import TensorBoardOutputConfig
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path = 's3://bucket-name/tensorboard_log_folder/'
)
estimator = TensorFlow(entry_point='train.py',
source_dir='./',
model_dir=model_dir,
output_path= output_dir,
train_instance_type=train_instance_type,
train_instance_count=1,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
base_job_name='Testing-TrainingJob',
framework_version='2.2',
py_version='py37',
script_mode=True,
tensorboard_output_config=tensorboard_output_config)
estimator.fit(inputs)
通过笔记本实例上的终端,使用上面提供的 S3
位置启动张量板.
$ tensorboard --logdir 's3://bucket-name/tensorboard_log_folder/'
使用/proxy/6006/
通过URL访问开发板.您需要在以下URL中更新笔记本实例的详细信息.
Access the board via URL with /proxy/6006/
. You need to update the notebook instance details in the following URL.
https://myinstance.notebook.us-east-1.sagemaker.aws/proxy/6006/
这篇关于我如何在AWS Sagemaker TensorFlow中使用Tensorboard?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!