Tensorflow 多 GPU - NCCL [英] Tensorflow Multi-GPU - NCCL

查看:58
本文介绍了Tensorflow 多 GPU - NCCL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直想增加批量大小以提高模型的泛化能力(它对批量大小非常敏感).解决方案是使用多 GPU 以利用更多内存.我在我的脚本中使用 tensorflow.keras(在 Windows 10 上使用 tensorflow 2.1),并按照说明为我的模型配置镜像策略.问题是我的训练脚本在没有镜像策略代码的情况下运行得非常好,但是使用镜像策略,我收到了关于 NCCL 的错误.这看起来与以下问题完全相同:

I have been wanting to increase my batch size to improve the generalization of my model (it's very batch size sensitive). The solution for this is to go multi-GPU in order to utilize more memory. I am using tensorflow.keras (with tensorflow 2.1 on Windows 10) in my script, and follow the instructions for configuring mirrored strategy for my model. The issue is that my training script runs perfectly fine without the mirrored strategy code, but with the mirrored strategy, I get an error regarding NCCL. This looks to be the exact same issue as:

https://github.com/tensorflow/tensorflow/issues/21470

不幸的是,该链接中讨论的解决方案:

Unfortunately, the solution discussed in that link:

cross_tower_ops = tf.contrib.distribute.AllReduceCrossDeviceOps(
'hierarchical_copy', num_packs=num_gpus))
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

不适用于 tf 2.1,因为 tf 的贡献"部分似乎已被删除.有谁知道 Windows 上 NCCL 的替代修复是什么,或者 tf 的contrib"部分已消失?

Does not work with tf 2.1 since the 'contrib' portion of tf appears to have been removed. Does anyone know what the replacement fix is for NCCL on Windows, or the replacement for the 'contrib' portion of tf that is gone?

推荐答案

问题 21470 的一个解决方案是为 Winx64 构建 nccl.MyCaffe 在此处提供了相关说明:https://github.com/MyCaffe/NCCL/blob/master/INSTALL.md

One solution from issue 21470 is to build nccl for Winx64. MyCaffe provides instructions for that here: https://github.com/MyCaffe/NCCL/blob/master/INSTALL.md

您需要 VS 2015、2017、CUDA 开发包,并在编译后将生成的 .dll 放在正确的位置.

You'll need VS 2015, 2017, CUDA development package, and to put the produced .dlls in the correct location once compiled.

这篇关于Tensorflow 多 GPU - NCCL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆