Tensorflow新的Op CUDA内核内存管理 [英] Tensorflow new Op CUDA kernel memory management

查看:184
本文介绍了Tensorflow新的Op CUDA内核内存管理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Tensorflow中使用GPU CUDA内核实现了一个相当复杂的新Op.此Op需要大量动态内存分配的变量,这些变量不是张量,并且在操作完成后被释放,更具体地说,它涉及使用哈希表.

I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel. This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table.

现在我正在使用 cudaMalloc() cudaFree(),但是我注意到Tensorflow有自己的类型,称为 Eigen :: GPUDevice 可以在GPU上分配和取消分配内存.

Right now I am using cudaMalloc() and cudaFree() but I have noticed Tensorflow has its own type called Eigen::GPUDevice which has the ability to allocate and deallocate memory on the GPU.

我的问题:

  1. 使用 Eigen :: GPUDevice 来管理GPU内存是最佳实践;
  2. 通过使用 Eigen :: GPUDevice 而不是CUDA API,我可以自动"启用多GPU支持,因为可以将不同的 GPUDevices 传递给Op;
  3. 我应该将此概念扩展到CPU内核,看看是否有 CPUDevice 类型也可以管理内存,而不是使用C ++语法(例如, auto var = new int [100]; delete [] var )
  1. Is it best practice to use Eigen::GPUDevice to manage GPU memory;
  2. By using Eigen::GPUDevice instead of the CUDA API I am "automatically" enabling multi-GPU support since different GPUDevices can be passed to the Op;
  3. Should I extend this idea to the CPU kernel and see if there is a CPUDevice type which also manages the memory instead of using C++ syntax (i.e. auto var = new int[100]; delete[] var)

推荐答案

此问题没有直接的公共指南.我通常只是让TensorFlow通过以下方式分配此信息:

The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by

template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
  explicit MyOp(OpKernelConstruction *context) :
      OpKernel(context)
  {
    // ...
  }

  void Compute(OpKernelContext *context) override
  {
    Tensor* tmp_var = nullptr;
    Tensor* output = nullptr;

    TensorShape some_shape, some_shape2;

    // temparily use this space
    OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
    // allocate memory for output tensor
    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));

  1. 无论需要什么内存,都应该由TensorFlow上下文分配,而不是通过自定义 cudaMalloc new type [num] 调用分配.
  2. 上下文应为分配者提供信息
  3. 请参阅下文

为简单起见,请考虑添加两个矩阵(完整示例).TensorFlow-Operations通常包含以下结构:

Consider, for the sake of simplicity just adding two matrices (full example). TensorFlow-Operations usually contain the following structure:

操作说明具有 REGISTER_OP ,它负责形状检查和设置输出形状(

Op description having REGISTER_OP, which is responsible for shape-checking, and setting the output shape (example)

OpKernel 负责分配内存,获取指向输入和设置内容的指针(请参见上文或

OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )

Functor 来实现,例如

Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);

您只是通过实施而离开了

You are just left by implementing

    // gpu version
    template <typename Dtype>
    struct MyFunctor<GPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

    // cpu version
    template <typename Dtype>
    struct MyFunctor<CPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

修改

  • allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
  • allocate_temp just tmp memory which will be not retained at the end of the Compute method lifetime. [example]

但我强烈建议您阅读

But I highly recommend reading the comment in the source-code here and then decided depending on your use case.

这篇关于Tensorflow新的Op CUDA内核内存管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆