TPU培训在培训过程中冻结 [英] TPU training freezes in the middle of training
问题描述
我正在尝试使用TPU v3-8 1.12实例在TF 1.12中训练CNN回归网.该模型成功使用XLA编译,开始了训练过程,但在某些情况下,在1t历时的半迭代之后冻结了,什么也不做.我找不到问题的根源.
I'm trying to train a CNN regression net in TF 1.12, using TPU v3-8 1.12 instance. The model succesfully compiles with XLA, starting the training process, but some where after the half iterations of the 1t epoch freezes, and doing nothing. I cannot find the root of the problem.
def read_tfrecord(example):
features = {
'image': tf.FixedLenFeature([], tf.string),
'labels': tf.FixedLenFeature([], tf.string)
}
sample=tf.parse_single_example(example, features)
image = tf.image.decode_jpeg(sample['image'], channels=3)
image = tf.reshape(image, tf.stack([540, 540, 3]))
image = augmentation(image)
labels = tf.decode_raw(sample['labels'], tf.float64)
labels = tf.reshape(labels, tf.stack([2,2,45]))
labels = tf.cast(labels, tf.float32)
return image, labels
def load_dataset(filenames):
files = tf.data.Dataset.list_files(filenames)
dataset = files.apply(tf.data.experimental.parallel_interleave(tf.data.TFRecordDataset, cycle_length=4))
dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func=read_tfrecord, batch_size=BATCH_SIZE, drop_remainder=True))
dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(1024, -1))
dataset = dataset.prefetch(buffer_size=1024)
return dataset
def augmentation(img):
image = tf.cast(img, tf.float32)/255.0
image = tf.image.random_brightness(image, max_delta=25/255)
image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
image = tf.image.per_image_standardization(image)
return image
def get_batched_dataset(filenames):
dataset = load_dataset(filenames)
return dataset
def get_training_dataset():
return get_batched_dataset(training_filenames)
def get_validation_dataset():
return get_batched_dataset(validation_filenames)
推荐答案
The most likely cause is an issue in the data pre-processing function, take a look at the troubleshooting documentation Errors in the middle of training, it could be helpful to get a guidance.
我没有发现您的代码有任何奇怪之处.
I did not catch anything strange with your code.
您是否使用云存储桶处理这些图像和文件?如果是,这些存储桶是否在同一地区?
Are you using Cloud Storage Buckets to work with those images and files? If yes, Are those buckets in the same region?
您可以使用 Cloud TPU审核日志来确定问题是否出在与系统中的资源或您如何访问数据有关.
You might use Cloud TPU Audit Logs to determine if the issue is related with the resources in the system or how you are accessing your data.
最后,我建议您看一下在云上训练面具RCNNTPU 文档.
Finally I suggest you to take a look in the Training Mask RCNN on Cloud TPU documentation.
这篇关于TPU培训在培训过程中冻结的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!