启动新纪元后,TensorFlow中的内存泄漏 [英] Memory leak in TensorFlow upon starting new epoch
问题描述
我正在使用TensorFlow中的训练脚本对两种不同类型的图像进行分类.这是创建数据集对象的类,该对象用于生成批处理和增加时期.在第一个纪元完成之前,它可以正常工作.然后,它在next_batch
方法内的行self._images = self._images[perm]
处失败.这对我来说没有任何意义,因为Python不应复制self._images,而只是重新整理数据.
I'm working on a training script in TensorFlow to classify two different types of images. Here is the class that creates a data set object, which is used to generate batches and increment epochs. It works fine until the first epoch is completed. It then fails at the line self._images = self._images[perm]
within the next_batch
method. This doesn't make sense to me, since Python shouldn't be duplicating self._images--only reshuffling the data.
class DataSet(object):
def __init__(self, images, labels, norm=True):
assert images.shape[0] == labels.shape[0], (
"images.shape: %s labels.shape: %s" % (images.shape,
labels.shape))
self._num_examples = images.shape[0]
self._images = images
self._labels = labels
self._epochs_completed = 0
self._index_in_epoch = 0
self._norm = norm
# Shuffle the data right away
perm = np.arange(self._num_examples)
np.random.shuffle(perm)
self._images = self._images[perm]
self._labels = self._labels[perm]
@property
def images(self):
return self._images
@property
def labels(self):
return self._labels
@property
def num_examples(self):
return self._num_examples
@property
def epochs_completed(self):
return self._epochs_completed
def next_batch(self, batch_size):
"""Return the next `batch_size` examples from this data set."""
start = self._index_in_epoch
self._index_in_epoch += batch_size
if self._index_in_epoch > self._num_examples:
# Finished epoch
self._epochs_completed += 1
print("Completed epoch %d.\n"%self._epochs_completed)
# Shuffle the data
perm = np.arange(self._num_examples)
np.random.shuffle(perm)
self._images = self._images[perm] # this is where OOM happens
self._labels = self._labels[perm]
# Start next epoch
在常规训练周期中,内存使用不会增加.这是培训代码的一部分. data_train_norm
是DataSet
对象.
Memory usage does not increase during ordinary training cycles. Here is the portion of training code. data_train_norm
is a DataSet
object.
batch_size = 300
csv_plot = open("csvs/train_plot.csv","a")
for i in range(3000):
batch = data_train_norm.next_batch(batch_size)
if i%50 == 0:
tce = cross_entropy.eval(feed_dict={x:batch[0],y_:batch[1],keep_prob:1.0},session=sess)
print("\nstep %d, train ce %g"%(i,tce))
print datetime.datetime.now()
csv_plot.write("%d, %g\n"%(i,tce))
train_step.run(feed_dict={x:batch[0],y_:batch[1],keep_prob:0.8},session=sess)
version = 1
saver.save(sess,'nets/cnn0nu_batch_gpu_roi_v%02d'%version)
csv_plot.close()
推荐答案
您是否正在使用dataset = dataset.shuffle(buffer_size)
?
尝试减小buffer_size
.这对我有用
这篇关于启动新纪元后,TensorFlow中的内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!