TPU培训在培训过程中冻结 [英] TPU training freezes in the middle of training

查看：63 发布时间：2021/4/22 19:36:00 neural-network cloud google-compute-engine tpu

本文介绍了TPU培训在培训过程中冻结的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用TPU v3-8 1.12实例在TF 1.12中训练CNN回归网.该模型成功使用XLA编译，开始了训练过程，但在某些情况下，在1t历时的半迭代之后冻结了，什么也不做.我找不到问题的根源.

I'm trying to train a CNN regression net in TF 1.12, using TPU v3-8 1.12 instance. The model succesfully compiles with XLA, starting the training process, but some where after the half iterations of the 1t epoch freezes, and doing nothing. I cannot find the root of the problem.

def read_tfrecord(example):
    features = {
        'image': tf.FixedLenFeature([], tf.string),
        'labels': tf.FixedLenFeature([], tf.string)
    }
    sample=tf.parse_single_example(example, features)
    image = tf.image.decode_jpeg(sample['image'], channels=3)
    image = tf.reshape(image, tf.stack([540, 540, 3]))
    image = augmentation(image)
    labels = tf.decode_raw(sample['labels'], tf.float64)
    labels = tf.reshape(labels, tf.stack([2,2,45]))
    labels = tf.cast(labels, tf.float32)
    return image, labels

def load_dataset(filenames):
    files = tf.data.Dataset.list_files(filenames)
    dataset = files.apply(tf.data.experimental.parallel_interleave(tf.data.TFRecordDataset, cycle_length=4))
    dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func=read_tfrecord, batch_size=BATCH_SIZE, drop_remainder=True))
    dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(1024, -1))
    dataset = dataset.prefetch(buffer_size=1024)
    return dataset

def augmentation(img):
    image = tf.cast(img, tf.float32)/255.0
    image = tf.image.random_brightness(image, max_delta=25/255)
    image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
    image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
    image = tf.image.per_image_standardization(image)
    return image

def get_batched_dataset(filenames):
    dataset = load_dataset(filenames)
    return dataset


def get_training_dataset():
    return get_batched_dataset(training_filenames)

def get_validation_dataset():
    return get_batched_dataset(validation_filenames)

TPU培训在培训过程中冻结 [英] TPU training freezes in the middle of training

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

TPU培训在培训过程中冻结 [英] TPU training freezes in the middle of training

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭