在多个 GPU 上训练 tensorflow 导致计算机崩溃 [英] Training tensorflow on multiple GPU crashes the computer

查看：54 发布时间：2021/9/5 19:37:22 tensorflow

本文介绍了在多个 GPU 上训练 tensorflow 导致计算机崩溃的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我们使用以下硬件配置使用 tensorflow 运行多个 GPU 训练:

We use the following hardware configuration to run multiple GPU training using tensorflow:

ubuntu 16.04
cuda 8
cudnn 5.1
8 titan X pascal
220GB of memory

训练代码基于 tensorflow/models github 存储库中发布的 slim.

The training code is based on slim as published in tensorflow/models github repository.

如果我们不使用所有 GPU(最多 4 个，经过测试)，我们就可以运行训练代码.但是，一旦我们使用了所有 8 个 GPU，计算机就会崩溃.

We are able to run the training code if we don't use all GPUs (up to 4, tested). But, once we use all 8 GPUs, the computer crashes.

这可能是什么原因?