TPU培训在培训过程中冻结 [英] TPU training freezes in the middle of training

查看:63
本文介绍了TPU培训在培训过程中冻结的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用TPU v3-8 1.12实例在TF 1.12中训练CNN回归网.该模型成功使用XLA编译,开始了训练过程,但在某些情况下,在1t历时的半迭代之后冻结了,什么也不做.我找不到问题的根源.

I'm trying to train a CNN regression net in TF 1.12, using TPU v3-8 1.12 instance. The model succesfully compiles with XLA, starting the training process, but some where after the half iterations of the 1t epoch freezes, and doing nothing. I cannot find the root of the problem.

def read_tfrecord(example):
    features = {
        'image': tf.FixedLenFeature([], tf.string),
        'labels': tf.FixedLenFeature([], tf.string)
    }
    sample=tf.parse_single_example(example, features)
    image = tf.image.decode_jpeg(sample['image'], channels=3)
    image = tf.reshape(image, tf.stack([540, 540, 3]))
    image = augmentation(image)
    labels = tf.decode_raw(sample['labels'], tf.float64)
    labels = tf.reshape(labels, tf.stack([2,2,45]))
    labels = tf.cast(labels, tf.float32)
    return image, labels

def load_dataset(filenames):
    files = tf.data.Dataset.list_files(filenames)
    dataset = files.apply(tf.data.experimental.parallel_interleave(tf.data.TFRecordDataset, cycle_length=4))
    dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func=read_tfrecord, batch_size=BATCH_SIZE, drop_remainder=True))
    dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(1024, -1))
    dataset = dataset.prefetch(buffer_size=1024)
    return dataset

def augmentation(img):
    image = tf.cast(img, tf.float32)/255.0
    image = tf.image.random_brightness(image, max_delta=25/255)
    image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
    image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
    image = tf.image.per_image_standardization(image)
    return image

def get_batched_dataset(filenames):
    dataset = load_dataset(filenames)
    return dataset


def get_training_dataset():
    return get_batched_dataset(training_filenames)

def get_validation_dataset():
    return get_batched_dataset(validation_filenames)

推荐答案

最可能的原因是数据预处理功能出现问题,请查看故障排除文档

The most likely cause is an issue in the data pre-processing function, take a look at the troubleshooting documentation Errors in the middle of training, it could be helpful to get a guidance.

我没有发现您的代码有任何奇怪之处.

I did not catch anything strange with your code.

您是否使用云存储桶处理这些图像和文件?如果是,这些存储桶是否在同一地区?

Are you using Cloud Storage Buckets to work with those images and files? If yes, Are those buckets in the same region?

您可以使用 Cloud TPU审核日志来确定问题是否出在与系统中的资源或您如何访问数据有关.

You might use Cloud TPU Audit Logs to determine if the issue is related with the resources in the system or how you are accessing your data.

最后,我建议您看一下在云上训练面具RCNNTPU 文档.

Finally I suggest you to take a look in the Training Mask RCNN on Cloud TPU documentation.

这篇关于TPU培训在培训过程中冻结的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆