在Tensorflow中进行训练时GPU使用率非常低 [英] Very low GPU usage during training in Tensorflow

查看:1489
本文介绍了在Tensorflow中进行训练时GPU使用率非常低的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为10类图像分类任务训练一个简单的多层感知器,这是Udacity深度学习课程作业的一部分。更准确地说,任务是对从各种字体(数据集称为notMNIST)呈现的字母进行分类。



我最终得到的代码看起来很简单,但是没有不管我在训练期间总是得到非常低的GPU使用率。我用GPU-Z测量负载,它仅显示25-30%。



这是我当前的代码:

  graph = tf.Graph()
与graph.as_default():
tf.set_random_seed(52)

#数据集定义
数据集= Dataset.from_tensor_slices({'x':train_data,'y':train_labels})
数据集= dataset.shuffle(buffer_size = 20000)
数据集= dataset.batch(128)
迭代器=数据集.make_initializable_iterator()
样本=迭代器.get_next()
x =样本['x']
y =样本['y']

#实际计算图
keep_prob = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool,name ='is_training')

fc1 = density_batch_relu_dropout(x,1024,is_training,keep_prob,'fc1')
fc2 = density_batch_relu_dropout(fc1,300,is_training,keep_prob,'fc2')
fc3 = density_batch_relu_droping(proc,50, 'fc3')
logits =密集(fc3,NUM_CL ASSES,'logits')

with tf.name_scope('accuracy'):
精度= tf.reduce_mean(
tf.cast(tf.equal(tf.argmax( y,1),tf.argmax(logits,1)),tf.float32),

precision_percent = 100 *精度

with tf.name_scope('loss' ):
损失= tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits,标签= y))

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
与tf.control_dependencies(update_ops):
#确保在执行批量标准化所需的train_op
#之前执行update_ops(显然)
train_op = tf.train.AdamOptimizer(learning_rate = 1e-3,epsilon = 1e-3)。使用tf.Session(graph = graph)作为sess最小化(损失)


tf.global_variables_initializer()。run()
step = 0
纪元= 0
而True:
sess.run(iterator.initializer,feed_dict = {})
而True:
步骤+ = 1
尝试:
sess.run(train_op,feed_dict = {keep_prob:0.5,is_training:True})
,除了tf.errors.OutOfRangeError:
记录器.info('epoch of end epoch#%d',epoch)
break

#epoch
train_l,train_ac = sess.run(
[ precision_percent],
feed_dict = {x:train_data,y:train_labels,keep_prob:1,is_training:False},

test_l,test_ac = sess.run(
[ ,precision_percent],
feed_dict = {x:test_data,y:test_labels,keep_prob:1,is_training:False},

logger.info('火车损失:%f,火车精度:%.2f %%',train_l,train_ac)
logger.info('测试损失:%f,测试准确性:%.2f %%',test_l,test_ac)

纪元+ = 1

这是我到目前为止所尝试的:


  1. 我更改了输入管道,从简单的 feed_dict tensorflow.contrib.data.Dataset 。据我了解,它应该照顾输入的效率,例如在单独的线程中加载数据。因此,应该没有与输入相关联的瓶颈。


  2. 我按照此处的建议收集了跟踪信息:

    解决方案

    MNIST规模的网络很小,很难为它们实现高GPU(或CPU)效率,我认为30%对于你的申请。批处理数量更大时,您将获得更高的计算效率,这意味着您每秒可以处理更多的示例,但统计效率也将降低,这意味着您需要总共处理更多的示例才能达到目标精度。所以这是一个权衡。对于像您这样的小角色模型,统计效率会在100后迅速下降,因此可能不值得尝试增加训练的批量大小。为了进行推断,您应该使用最大的批处理大小。


    I am trying to train a simple multi-layer perceptron for a 10-class image classification task, which is a part of the assignment for the Udacity Deep-Learning course. To be more precise, the task is to classify letters rendered from various fonts (the dataset is called notMNIST).

    The code I ended up with looks fairly simple, but no matter what I always get very low GPU usage during training. I measure load with GPU-Z and it shows just 25-30%.

    Here is my current code:

    graph = tf.Graph()
    with graph.as_default():
        tf.set_random_seed(52)
    
        # dataset definition
        dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels})
        dataset = dataset.shuffle(buffer_size=20000)
        dataset = dataset.batch(128)
        iterator = dataset.make_initializable_iterator()
        sample = iterator.get_next()
        x = sample['x']
        y = sample['y']
    
        # actual computation graph
        keep_prob = tf.placeholder(tf.float32)
        is_training = tf.placeholder(tf.bool, name='is_training')
    
        fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')
        fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')
        fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')
        logits = dense(fc3, NUM_CLASSES, 'logits')
    
        with tf.name_scope('accuracy'):
            accuracy = tf.reduce_mean(
                tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),
            )
            accuracy_percent = 100 * accuracy
    
        with tf.name_scope('loss'):
            loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
    
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(update_ops):
            # ensures that we execute the update_ops before performing the train_op
            # needed for batch normalization (apparently)
            train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)
    
    with tf.Session(graph=graph) as sess:
        tf.global_variables_initializer().run()
        step = 0
        epoch = 0
        while True:
            sess.run(iterator.initializer, feed_dict={})
            while True:
                step += 1
                try:
                    sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True})
                except tf.errors.OutOfRangeError:
                    logger.info('End of epoch #%d', epoch)
                    break
    
            # end of epoch
            train_l, train_ac = sess.run(
                [loss, accuracy_percent],
                feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False},
            )
            test_l, test_ac = sess.run(
                [loss, accuracy_percent],
                feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False},
            )
            logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac)
            logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac)
    
            epoch += 1
    

    Here's what I tried so far:

    1. I changed the input pipeline from simple feed_dict to tensorflow.contrib.data.Dataset. As far as I understood, it is supposed to take care of the efficiency of the input, e.g. load data in a separate thread. So there should not be any bottleneck associated with the input.

    2. I collected traces as suggested here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659 However, these traces didn't really show anything interesting. >90% of the train step is matmul operations.

    3. Changed batch size. When I change it from 128 to 512 the load increases from ~30% to ~38%, when I increase it further to 2048, the load goes to ~45%. I have 6Gb GPU memory and dataset is single channel 28x28 images. Am I really supposed to use such a big batch size? Should I increase it further?

    Generally, should I worry about the low load, is it really a sign that I am training inefficiently?

    Here's the GPU-Z screenshots with 128 images in the batch. You can see low load with occasional spikes to 100% when I measure accuracy on the entire dataset after each epoch.

    解决方案

    MNIST size networks are tiny and it's hard to achieve high GPU (or CPU) efficiency for them, I think 30% is not unusual for your application. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models like yours, the statistical efficiency drops off very quickly after a 100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.

    这篇关于在Tensorflow中进行训练时GPU使用率非常低的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆