tensorflow gpu仅在CPU上运行 [英] tensorflow gpu is only running on CPU

查看:213
本文介绍了tensorflow gpu仅在CPU上运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows 10和所有必要的Nvidia/Cuda软件包上安装了Anaconda-Navigator,创建了一个名为tensorflow-gpu-env的新环境,更新了PATH信息等.当我运行模型(使用tensorflow.keras进行构建)时,我看到CPU利用率显着提高,GPU利用率为0%,并且该模型只是无法训练.

I installed Anaconda-Navigatoron Windows 10 and all necessary Nvidia/Cuda packages, created a new environment called tensorflow-gpu-env, updated PATH information, etc. When I run a model (build by using tensorflow.keras), I see that CPU utilization increases significantly, GPU utilization is 0%, and the model just does not train.

我进行了一些测试以确保外观:

I run a couple of tests to make sure how things look:

print(tf.test.is_built_with_cuda())
True

上面的输出("True")看起来正确.

The above output ('True') looks correct.

另一种尝试:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

输出:

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 1634313269296444741
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 1478485606
locality {
  bus_id: 1
  links {
  }
}
incarnation: 16493618810057409699
physical_device_desc: "device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0"
]

到目前为止一切顺利...在我的代码后面,我从以下代码开始训练:

So far so good... Later in my code, I start the training with the following code:

history = merged_model.fit_generator(generator=train_generator,
                                     epochs=60,
                                     verbose=2,
                                     callbacks=[reduce_lr_on_plateau],
                                     validation_data=val_generator,
                                     use_multiprocessing=True,
                                     max_queue_size=50,
                                     workers=3)

我还尝试按照以下方式进行培训:

I also tried to run the training as following:

with tf.device('/gpu:0'):
    history = merged_model.fit_generator(generator=train_generator,
                                         epochs=60,
                                         verbose=2,
                                         callbacks=[reduce_lr_on_plateau],
                                         validation_data=val_generator,
                                         use_multiprocessing=True,
                                         max_queue_size=50,
                                         workers=3)

无论我如何开始培训,它都永远不会开始培训,我一直看到CPU利用率在GPU利用率为0%的情况下不断增加.

No matter how I start the training, it never starts the training, I keep seeing increased CPU utilization with 0% GPU utilization.

为什么我的tensorflow-gpu安装仅使用CPU?花了HOURS,实际上没有任何进展.

Why is my tensorflow-gpu installation is only using the CPU? Spent HOURS with literally no progress.

附录

当我在控制台上运行conda list时,我看到以下关于tensorflow的信息:

When I run conda list on the console, I see the following regarding tensorflow:

tensorflow-base           1.11.0          gpu_py36h6e53903_0
tensorflow-gpu            1.11.0                    <pip>

这个基于tensorflow的基础是什么?会引起问题吗?在安装tensorflow-gpu之前,请确保同时使用conda和pip卸载了tensorflow和tensorflow-gpu.然后使用pip安装tensorflow-gpu.我不确定这个是否随我的tensorflow-gpu安装一起提供.

What is this tensorflow-base? Can it cause a problem? Before installing tensorflow-gpu, I made sure that I uninstalled tensorflow and tensorflow-gpu by using both conda and pip; and then installed tensorflow-gpu by using pip. I am not sure if this tensorflow-base came with my tensorflow-gpu installation.

附录2 tensorflow-base似乎是conda的一部分,因为我可以使用conda uninstall tensorflow-base卸载它.我仍然有tensorflow-gpu安装,但现在无法再导入tensorflow.它说:没有名为tensorflow的模块".看来我的conda环境没有看到我的tensorflor-gpu安装.此刻我很困惑.

ADDENDUM 2 It looks like tensorflow-base was a part of conda because I could uninstall it with conda uninstall tensorflow-base. I still have tensorflow-gpu installation in place but I now cannot import tensorflow anymore. It says "No module named tensorflow". It looks like my conda environment is not seeing my tensorflor-gpu installation. I am quite confused at the moment.

推荐答案

@Smokrow,谢谢您以上的回答. 在Windows平台上,Keras似乎在多处理方面遇到了问题.

@Smokrow, thank you for your answers above. It appears to be the case that Keras seems to have problems with multiprocessing in Windows platforms.

history = merged_model.fit_generator(generator=train_generator,
                                     epochs=60,
                                     verbose=2,
                                     callbacks=[reduce_lr_on_plateau],
                                     validation_data=val_generator,
                                     use_multiprocessing=True,
                                     max_queue_size=50,
                                     workers=3)

以上代码段导致Keras挂起,实际上看不到进度.如果用户在Windows上运行其代码,则use_multiprocessor需要设置为False!否则,它将不起作用. 有趣的是,仍然可以将worker设置为大于1的数字,并且仍然可以提高性能.我很难理解后台实际发生的情况,但是确实可以提高性能.因此,下面的代码使它可以工作.

The piece of code above causes the Keras to hang and literally no progress is seen. If the user is runing his/her code on Windows, use_multiprocessor needs to be set to False! Otherwise, it does not work. Interestingly, workers can still be set to a number that is greater than one and it still gives performance benefits. I am having difficulties to understand what really is happening in the background but it does give performance improvement. So the following piece of code made it work.

history = merged_model.fit_generator(generator=train_generator,
                                     epochs=60,
                                     verbose=2,
                                     callbacks=[reduce_lr_on_plateau],
                                     validation_data=val_generator,
                                     use_multiprocessing=False,  # CHANGED
                                     max_queue_size=50,
                                     workers=3)

这篇关于tensorflow gpu仅在CPU上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆