验证 GPU 是否在 Keras/Tensorflow 中实际使用,而不仅仅是验证为存在 [英] Verifying if GPU is actually used in Keras/Tensorflow, not just verified as present

查看:30
本文介绍了验证 GPU 是否在 Keras/Tensorflow 中实际使用,而不仅仅是验证为存在的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚构建了一个深度学习平台(AMD 12 核线程撕裂器;GeForce RTX 2080 ti;64Gb RAM).本来想在 Ubuntu 19.0 上安装 CUDnn 和 CUDA 的,但是安装太痛苦了,看了一会,决定切换到 Windows 10...

I've just built a deep learning rig (AMD 12 core threadripper; GeForce RTX 2080 ti; 64Gb RAM). I originally wanted to install CUDnn and CUDA on Ubuntu 19.0, but the installation was too painful and after reading around a bit, I decided to switch to Windows 10...

在 condas 内外安装了几次 tensorflow-gpu 之后,我遇到了进一步的问题,我认为这些问题归结为 CUDnn-CUDA-tensorflow 的兼容性,因此卸载了各种版本的 CUDA 和 tf.我从 nvcc --version 的输出:

After doing several installs of tensorflow-gpu, in and outside condas, I ran into further issues which I assumed was down to the CUDnn-CUDA-tensorflow compatibility, so uninstalled various versions of CUDA and tf. My output from nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130

还附上了nvidia-smi(显示CUDA==11.0?!)

Attached also nvidia-smi (which shows CUDA==11.0?!)

我也有:

 if tf.test.gpu_device_name():
        print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
    else:
        print("Please install GPU version of TF")
    print("keras version: {0} | Backend used: {1}".format(keras.__version__, backend.backend()))
    print("tensorflow version: {0} | Backend used: {1}".format(tf.__version__, backend.backend()))
    print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
    print("CUDA: {0} | CUDnn: {1}".format(tf_build_info.cuda_version_number,  tf_build_info.cudnn_version_number))

带输出:

My device: [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 12853915229880452239
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 9104897474
lo

    cality {
      bus_id: 1
      links {
      }
    }
    incarnation: 7328135816345461398
    physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:42:00.0, compute capability: 7.5"
    ]
    Default GPU Device: /device:GPU:0
    keras version: 2.3.1 | Backend used: tensorflow
    tensorflow version: 2.1.0 | Backend used: tensorflow
    Num GPUs Available:  1
    CUDA: 10.1 | CUDnn: 7

所以(我希望)我的安装至少部分有效,我只是仍然不知道 GPU 是否正在用于我的训练,或者它是否只是被识别存在,但是CPU 仍在使用.我该如何区分呢?

So (I hope) my installation has at least partly worked, I just still don't know whether the GPU is being used for my training, or if it's just recognised as existing, but the CPU is still being used. How can I differentiate this?

我也使用 pycharm.有一个安装 Visio Studio 的建议和一个额外的步骤 这里:

I also use pycharm. There was a recommendation for the installation of Visio Studio and an additional step here:

5. Include cudnn.lib in your Visual Studio project.
Open the Visual Studio project and right-click on the project name.
Click Linker > Input > Additional Dependencies.
Add cudnn.lib and click OK.

我没有做这一步.我还读到我需要在环境变量中设置以下内容,但我的目录是空的:

I didn't do this step. I also read that I need to set the following in environment variables, but my directory is empty:

SET PATH=C:	oolscudain;%PATH%

有人可以验证吗?

还有一个我的 kera 模型需要搜索超参数:

Also one my kera models requires a search for hyperparameters:

grid = GridSearchCV(estimator=model,
                        param_grid=param_grids,
                        n_jobs=-1, # -1 for all cores
                        cv=KFold(),
                        verbose=10)

grid_result = grid.fit(X_standardized, Y)

这在我的 MBP 上运行良好(当然假设 n_jobs=-1 占用所有 CPU 内核).在我的 DL 装备上,我收到警告:

This works fine on my MBP (assuming of course the n_jobs=-1 takes all CPU cores). On my DL rig, I get warnings:

ERROR: The process with PID 5156 (child process of PID 1184) could not be terminated.
Reason: Access is denied.
ERROR: The process with PID 1184 (child process of PID 6920) could not be terminated.
Reason: There is no running instance of the task.
2020-03-28 20:29:48.598918: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.599348: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.599655: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.603023: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.603649: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.604236: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.604773: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.605524: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.608151: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-03-28 20:29:48.608369: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
2020-03-28 20:29:48.608559: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: Blas GEMM launch failed : a.shape=(10, 8), b.shape=(8, 4), m=10, n=4, k=8
     [[{{node dense_1/MatMul}}]]
C:UsersmePycharmProjectsuntitledvenvlibsite-packagessklearnmodel_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(10, 8), b.shape=(8, 4), m=10, n=4, k=8
     [[node dense_1/MatMul (defined at C:UsersmePycharmProjectsuntitledvenvlibsite-packageskerasackend	ensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_982]

我可以假设在使用 GridSearchCV 时,它只使用 CPU,而不是 GPU?尽管如此,当在我的代码中运行和计时另一种方法时,我将 MBP 的时间(使用 2.8 GHz Intel Core i7 大约 40 秒)与桌面的时间(使用 12 核线程撕裂器大约 43 秒)进行比较.即使在比较 CPU 时,我也希望比 MBP 快得多.那么我的假设是错误的吗?

Can I assume when using GridSearchCV, this utilises only the CPU, and not the GPU? Still, when running and timing another method in my code, I compare the MBP's time (approx 40s with 2,8 GHz Intel Core i7) compared to the Desktop's time (approx 43s with a 12 core threadripper). Even when comparing the CPUs I'd expect a far quicker time than the MBP. Is my assumption then wrong?

推荐答案

你可以看到以下细节这里.
根据文档:

You can see the following details here.
Based on the documentation:

If a TensorFlow operation has both CPU and GPU implementations, 
by default, the GPU devices will be given priority when the operation is assigned to a device.
For example, tf.matmul has both CPU and GPU kernels. 
On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device.

记录设备放置

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Example Result
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

用于手动放置设备

tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/GPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

c = tf.matmul(a, b)
print(c)

Example Result: 
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

这篇关于验证 GPU 是否在 Keras/Tensorflow 中实际使用,而不仅仅是验证为存在的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆