Tensorflow新的Op CUDA内核内存管理 [英] Tensorflow new Op CUDA kernel memory management
问题描述
我已经在Tensorflow中使用GPU CUDA内核实现了一个相当复杂的新Op.此Op需要大量动态内存分配的变量,这些变量不是张量,并且在操作完成后被释放,更具体地说,它涉及使用哈希表.
I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel. This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table.
现在我正在使用 cudaMalloc()
和 cudaFree()
,但是我注意到Tensorflow有自己的类型,称为 Eigen :: GPUDevice
可以在GPU上分配和取消分配内存.
Right now I am using cudaMalloc()
and cudaFree()
but I have noticed Tensorflow has its own type called Eigen::GPUDevice
which has the ability to allocate and deallocate memory on the GPU.
我的问题:
- 使用
Eigen :: GPUDevice
来管理GPU内存是最佳实践; - 通过使用
Eigen :: GPUDevice
而不是CUDA API,我可以自动"启用多GPU支持,因为可以将不同的GPUDevices
传递给Op; - 我应该将此概念扩展到CPU内核,看看是否有
CPUDevice
类型也可以管理内存,而不是使用C ++语法(例如,auto var = new int [100]; delete [] var
)
- Is it best practice to use
Eigen::GPUDevice
to manage GPU memory; - By using
Eigen::GPUDevice
instead of the CUDA API I am "automatically" enabling multi-GPU support since differentGPUDevices
can be passed to the Op; - Should I extend this idea to the CPU kernel and see if there is a
CPUDevice
type which also manages the memory instead of using C++ syntax (i.e.auto var = new int[100]; delete[] var
)
推荐答案
此问题没有直接的公共指南.我通常只是让TensorFlow通过以下方式分配此信息:
The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by
template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
explicit MyOp(OpKernelConstruction *context) :
OpKernel(context)
{
// ...
}
void Compute(OpKernelContext *context) override
{
Tensor* tmp_var = nullptr;
Tensor* output = nullptr;
TensorShape some_shape, some_shape2;
// temparily use this space
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
// allocate memory for output tensor
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));
- 无论需要什么内存,都应该由TensorFlow上下文分配,而不是通过自定义
cudaMalloc
或new type [num]
调用分配. - 上下文应为分配者提供信息
- 请参阅下文
为简单起见,请考虑添加两个矩阵(完整示例).TensorFlow-Operations通常包含以下结构:
Consider, for the sake of simplicity just adding two matrices (full example). TensorFlow-Operations usually contain the following structure:
操作说明具有 REGISTER_OP
,它负责形状检查和设置输出形状(
Op description having REGISTER_OP
, which is responsible for shape-checking, and setting the output shape (example)
OpKernel 负责分配内存,获取指向输入和设置内容的指针(请参见上文或
OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )
Functor 来实现,例如
Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);
您只是通过实施而离开了
You are just left by implementing
// gpu version
template <typename Dtype>
struct MyFunctor<GPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
// cpu version
template <typename Dtype>
struct MyFunctor<CPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
修改
- allocate_persistent:如果您需要在Op调用之间的数据(例如一次性索引结构),请使用此参数.[示例]
- allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
- allocate_temp just tmp memory which will be not retained at the end of the
Compute
method lifetime. [example]
But I highly recommend reading the comment in the source-code here and then decided depending on your use case.
这篇关于Tensorflow新的Op CUDA内核内存管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!