Tensorflow Java Multi-GPU推理 [英] Tensorflow Java Multi-GPU inference

查看:108
本文介绍了Tensorflow Java Multi-GPU推理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有多个GPU的服务器,并希望在Java应用程序内部进行模型推断时充分利用它们. 默认情况下,tensorflow会占用所有可用的GPU,但仅使用第一个.

I have a server with multiple GPUs and want to make full use of them during model inference inside a java app. By default tensorflow seizes all available GPUs, but uses only the first one.

我可以想到三种解决此问题的方法:

I can think of three options to overcome this issue:

  1. 在进程级别上限制设备可见性,即使用CUDA_VISIBLE_DEVICES环境变量.

这将需要我运行Java应用程序的多个实例并在其中分配流量.不是那个诱人的主意.

That would require me to run several instances of the java app and distribute traffic among them. Not that tempting idea.

在单个应用程序内启动多个会话,然后尝试通过ConfigProto将一个设备分配给每个设备:

Launch several sessions inside a single application and try to assign one device to each of them via ConfigProto:

public class DistributedPredictor {

    private Predictor[] nested;
    private int[] counters;

    // ...

    public DistributedPredictor(String modelPath, int numDevices, int numThreadsPerDevice) {
        nested = new Predictor[numDevices];
        counters = new int[numDevices];

        for (int i = 0; i < nested.length; i++) {
            nested[i] = new Predictor(modelPath, i, numDevices, numThreadsPerDevice);
        }
    }

    public Prediction predict(Data data) {
        int i = acquirePredictorIndex();
        Prediction result = nested[i].predict(data);
        releasePredictorIndex(i);
        return result;
    }

    private synchronized int acquirePredictorIndex() {
        int i = argmin(counters);
        counters[i] += 1;
        return i;
    }

    private synchronized void releasePredictorIndex(int i) {
        counters[i] -= 1;
    }
}


public class Predictor {

    private Session session;

    public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {

        GPUOptions gpuOptions = GPUOptions.newBuilder()
                .setVisibleDeviceList("" + deviceIdx)
                .setAllowGrowth(true)
                .build();

        ConfigProto config = ConfigProto.newBuilder()
                .setGpuOptions(gpuOptions)
                .setInterOpParallelismThreads(numDevices * numThreadsPerDevice)
                .build();

        byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
        Graph graph = new Graph();
        graph.importGraphDef(graphDef);

        this.session = new Session(graph, config.toByteArray());
    }

    public Prediction predict(Data data) {
        // ...
    }
}

这种方法一目了然.但是,会话有时会忽略setVisibleDeviceList选项,所有会话都移至第一台设备,从而导致内存不足崩溃.

This approach seems to work fine at a glance. However, sessions occasionally ignore setVisibleDeviceList option and all go for the first device causing Out-Of-Memory crash.

使用tf.device()规范以python多塔方式构建模型.在Java方面,在共享会话中提供不同的Predictor不同塔.

Build the model in a multi-tower fashion in python using tf.device() specification. On java side, give different Predictors different towers inside a shared session.

给我带来麻烦和习惯性的错误.

Feels cumbersome and idiomatically wrong to me.

更新:按照@ash的建议,还有另一种选择:

UPDATE: As @ash proposed, there's yet another option:

  1. 通过修改其定义(graphDef)将适当的设备分配给现有图形的每个操作.

  1. Assign an appropriate device to each operation of the existing graph by modifying its definition (graphDef).

要完成它,可以改编方法2中的代码:

To get it done, one could adapt the code from Method 2:

public class Predictor {

    private Session session;

    public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {

        byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
        graphDef = setGraphDefDevice(graphDef, deviceIdx)

        Graph graph = new Graph();
        graph.importGraphDef(graphDef);

        ConfigProto config = ConfigProto.newBuilder()
                .setAllowSoftPlacement(true)
                .build();

        this.session = new Session(graph, config.toByteArray());
    }

    private static byte[] setGraphDefDevice(byte[] graphDef, int deviceIdx) throws InvalidProtocolBufferException {
        String deviceString = String.format("/gpu:%d", deviceIdx);

        GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
        for (int i = 0; i < builder.getNodeCount(); i++) {
            builder.getNodeBuilder(i).setDevice(deviceString);
        }
        return builder.build().toByteArray();
    }

    public Prediction predict(Data data) {
        // ...
    }
}

就像其他提到的方法一样,这一方法也让我摆脱了手动在设备之间分发数据的麻烦.但至少它运行稳定并且相对容易实现.总体而言,这看起来像(几乎)正常的技术.

Just like other mentioned approaches, this one doesn't set me free from manually distributing data among devices. But at least it works stably and is comparably easy to implement. Overall, this looks like an (almost) normal technique.

使用tensorflow java API是否有一种优雅的方法来完成这样的基本操作?任何想法,将不胜感激.

Is there an elegant way to do such a basic thing with tensorflow java API? Any ideas would be appreciated.

推荐答案

简而言之:有一种解决方法,即每个GPU只能进行一个会话.

In short: There is a workaround, where you end up with one session per GPU.

详细信息:

一般流程是TensorFlow运行时会遵守为图中的操作指定的设备.如果没有为操作指定设备,则它将基于一些启发式方法放置"它.这些启发式方法当前会导致如果有GPU可用并且有GPU内核可在GPU:0上进行操作"(

The general flow is that the TensorFlow runtime respects the devices specified for operations in the graph. If no device is specified for an operation, then it "places" it based on some heuristics. Those heuristics currently result in "place operation on GPU:0 if GPUs are available and there is a GPU kernel for the operation" (Placer::Run in case you're interested).

您要求的是TensorFlow的合理功能请求-能够将序列化图中的设备视为虚拟"设备,以便在运行时映射到一组"physscal"设备,或者设置默认设备".该功能目前不存在.您可能要向ConfigProto添加这样的选项.

What you ask for I think is a reasonable feature request for TensorFlow - the ability to treat devices in the serialized graph as "virtual" ones to be mapped to a set of "phyiscal" devices at run time, or alternatively setting the "default device". This feature does not currently exist. Adding such an option to ConfigProto is something you may want to file a feature request for.

我可以在此期间建议一种解决方法.首先,对您提出的解决方案进行一些评论.

I can suggest a workaround in the interim. First, some commentary on your proposed solutions.

  1. 您的第一个想法肯定会奏效,但是正如您所指出的那样,这很麻烦.

  1. Your first idea will surely work, but as you pointed out, is cumbersome.

ConfigProto中使用visible_device_list进行的设置不太可行,因为这实际上是每个进程的设置,并且在该进程中创建第一个会话后将被忽略.当然,它并没有得到应有的记录(并且不幸的是,这出现在每个会话的配置中).但是,这解释了为什么您的建议不起作用以及为什么仍然使用单个GPU.

Setting using visible_device_list in the ConfigProto doesn't quite work out since that is actually a per-process setting and is ignored after the first session is created in the process. This is certainly not documented as well as it should be (and somewhat unfortunate that this appears in the per-Session configuration). However, this explains why your suggestion here doesn't work and why you still see a single GPU being used.

这可能有效.

另一种选择是得到不同的图形(将操作显式放置在不同的GPU上),从而导致每个GPU进行一个会话.这样的东西可以用来编辑图形并为每个操作显式分配一个设备:

Another option is to end up with different graphs (with operations explicitly placed on different GPUs), resulting in one session per GPU. Something like this can be used to edit the graph and explicitly assign a device to each operation:

public static byte[] modifyGraphDef(byte[] graphDef, String device) throws Exception {
  GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
  for (int i = 0; i < builder.getNodeCount(); ++i) {
    builder.getNodeBuilder(i).setDevice(device);
  }
  return builder.build().toByteArray();
} 

之后,您可以使用以下方法为每个GPU创建GraphSession:

After which you could create a Graph and Session per GPU using something like:

final int NUM_GPUS = 8;
// setAllowSoftPlacement: Just in case our device modifications were too aggressive
// (e.g., setting a GPU device on an operation that only has CPU kernels)
// setLogDevicePlacment: So we can see what happens.
byte[] config =
    ConfigProto.newBuilder()
        .setLogDevicePlacement(true)
        .setAllowSoftPlacement(true)
        .build()
        .toByteArray();
Graph graphs[] = new Graph[NUM_GPUS];
Session sessions[] = new Session[NUM_GPUS];
for (int i = 0; i < NUM_GPUS; ++i) {
  graphs[i] = new Graph();
  graphs[i].importGraphDef(modifyGraphDef(graphDef, String.format("/gpu:%d", i)));
  sessions[i] = new Session(graphs[i], config);    
}

然后使用sessions[i]在GPU #i上执行图形.

Then use sessions[i] to execute the graph on GPU #i.

希望有帮助.

这篇关于Tensorflow Java Multi-GPU推理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆