Tensorflow的代码相同,但从CPU设备到GPU设备的结果不同 [英] Tensorflow same code but get different result from CPU device to GPU device

查看:985
本文介绍了Tensorflow的代码相同,但从CPU设备到GPU设备的结果不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现一个程序来测试GPU设备上的Tensorflow性能。
数据测试是MNIST数据,使用多层感知器(神经网络)进行监督训练。
我遵循了这个简单的示例,但是我将性能批次渐变的数量更改为10000

I am trying to implement a program to test the Tensorflow performance on GPU device. Data test is MNIST data, supervised training using Multilayer perceptron(Neural networks). I followed this simple example but I change the number of performance batch gradient to 10000

for i in range(10000) :
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step,feed_dict={x : batch_xs, y_ : batch_ys})
if i % 500 == 0:
    print(i)

最终,当我使用此代码检查预测准确性时

Eventually, when I check the predict accuracy using this code

correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,"float"))
print(sess.run(accuracy,feed_dict={x:mnist.test.images,y_:mnist.test.labels}))
print(tf.convert_to_tensor(mnist.test.images).get_shape())

事实证明,准确率因CPU而异:当GPU返回准确率约为0.9xx,而CPU仅返回0.3xx。
有人知道原因吗?还是为什么会发生该问题?

it turns out that the accuracy rate is different from CPU to GPU: when GPU returns the accuracy rate approximately 0.9xx while CPU returns only 0.3xx. Does anyone know the reason? or why can that issue happen?

推荐答案

这种行为有两个主要原因(除bug外)。

There are two primary reasons for this kind of behavior (besides bugs).

数值稳定性

事实证明,加数并不像想象的那么容易似乎。假设我要加一万亿2。正确答案是2万亿。但是,如果您在一台机器上仅以32位的字长将它们加到浮点中,过一会儿,您的答案将停留在较小的值上。原因是过了一会儿,您要添加的2小于浮点数的尾数的最小位。

It turns out that adding numbers is not entirely as easy as it might seem. Let's say I want to add a trillion 2's together. The correct answer is two trillion. But if you add these together in floating point on a machine with a wordsize of only, say 32 bits, after a while, your answer will get stuck at a smaller value. The reason is that after a while, the 2's that you're adding are below the smallest bit of the mantissa of the floating point sum.

数值计算,这种特殊的差异在TensorFlow( 1 2 等)。

These kinds of issues abound in numerical computing, and this particular discrepancy is known in TensorFlow (1,2, to name a few). It's possible that you're seeing an effect of this.

初始条件

训练神经网络是一个随机过程,因此,这取决于您的初始条件。有时,尤其是当您的超参数调整得不太好时,您的网络将陷于较差的局部极小值附近,从而最终导致行为平庸。调整优化程序参数(或更好的方法是使用Adam之类的自适应方法)可能会在这里有所帮助。

Training a neural nets is a stochastic process, and as such, it depends on your initial conditions. Sometimes, especially if your hyperparameters are not tuned very well, your net will get stuck near a poor local minima, and you'll end up with mediocre behavior. Adjusting your optimizer parameters (or better, using an adaptive method like Adam) might help out here.

当然,尽管如此,这是一个很大的差异,因此我会仔细检查您的结果,然后再归咎于基础数学软件包或运气不好。

Of course, with all that said, this is a fairly large difference, so I'd double check your results before blaming it on the underlying math package or bad luck.

这篇关于Tensorflow的代码相同,但从CPU设备到GPU设备的结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆