Float16比Keras中的Float32慢 [英] Float16 slower than float32 in keras

查看:234
本文介绍了Float16比Keras中的Float32慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试我的新NVIDIA Titan V,它支持float16操作.我注意到在训练过程中,float16(〜800 ms/step)比float32(〜500 ms/step)要慢得多.

I'm testing out my new NVIDIA Titan V, which supports float16 operations. I noticed that during training, float16 is much slower (~800 ms/step) than float32 (~500 ms/step).

要进行float16操作,我将keras.json文件更改为:

To do float16 operations, I changed my keras.json file to:

{
"backend": "tensorflow",
"floatx": "float16",
"image_data_format": "channels_last",
"epsilon": 1e-07
}

为什么float16操作这么慢?我是否需要修改我的代码,而不仅仅是keras.json文件?

Why are the float16 operations so much slower? Do I need to make modifications to my code and not just the keras.json file?

我在Windows 10上使用CUDA 9.0,cuDNN 7.0,tensorflow 1.7.0和keras 2.1.5. 我的python 3.5代码如下:

I am using CUDA 9.0, cuDNN 7.0, tensorflow 1.7.0, and keras 2.1.5 on Windows 10. My python 3.5 code is below:

img_width, img_height = 336, 224

train_data_dir = 'C:\\my_dir\\train'
test_data_dir = 'C:\\my_dir\\test'
batch_size=128

datagen = ImageDataGenerator(rescale=1./255,
    horizontal_flip=True,   # randomly flip the images 
    vertical_flip=True) 

train_generator = datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

test_generator = datagen.flow_from_directory(
    test_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

# Architecture of NN
model = Sequential()
model.add(Conv2D(32,(3, 3), input_shape=(img_height, img_width, 3),padding='same',kernel_initializer='lecun_normal'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(AveragePooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))

my_rmsprop = keras.optimizers.RMSprop(lr=0.0001, rho=0.9, epsilon=1e-04, decay=0.0)
model.compile(loss='binary_crossentropy',
          optimizer=my_rmsprop,
          metrics=['accuracy'])

# Training 
nb_epoch = 32
nb_train_samples = 512
nb_test_samples = 512

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples/batch_size,
    epochs=nb_epoch,
    verbose=1,
    validation_data=test_generator,
    validation_steps=nb_test_samples/batch_size)

# Evaluating on the testing set
model.evaluate_generator(test_generator, nb_test_samples)

推荐答案

我已更新为CUDA 10.0,cuDNN 7.4.1,tensorflow 1.13.1,keras 2.2.4和python 3.7.3.使用与OP中相同的代码,与float32相比,float16的训练时间略短一些.

I updated to CUDA 10.0, cuDNN 7.4.1, tensorflow 1.13.1, keras 2.2.4, and python 3.7.3. Using the same code as in the OP, training time was marginally faster with float16 over float32.

我完全希望更复杂的网络体系结构在性能上会显示出更大的差异,但是我没有对此进行测试.

I fully expect that a more complex network architecture would show a bigger difference in performance, but I didn't test this.

这篇关于Float16比Keras中的Float32慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆