使用自定义tensorflow gpu op进行编译时遇到麻烦 [英] trouble compiling with custom tensorflow gpu op

查看:126
本文介绍了使用自定义tensorflow gpu op进行编译时遇到麻烦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是c ++的新手,但是设法为tensorflow制定了一个新的原始cpu op。现在,我想为gpu制定一个方案。我在open-cl方面有一点经验。我在这里遵循指南:

I am a newbie at c++, but managed to work out an original new cpu op for tensorflow. Now I would like to work out a op for the gpu. I have a little experience working on open-cl. I am following the guide here:

https://www.tensorflow.org/versions/r0.11/how_tos/adding_an_op/index.html#gpu-support

是我的c ++代码,后跟cuda文件。这段代码我什么也没做。它可以正确编译,但是每次我尝试运行它时,都会得到一个核心转储。为了进行调试,我删除了类的所有内容,以便专注于该问题。基本上也这样说:

below is my c++ code, followed by the cuda file. I don't do anything with this code. It compiles correctly but every time I try to run it I get a core dump. For the purpose of debugging I have removed all the contents of my class so that I can focus on the problem. It also says essentially this:

    *** Error in `/usr/bin/python': free(): invalid next size (fast): 0x00007fef04033ba0 ***

这是d_grid_gpu.cc文件:

here is the d_grid_gpu.cc file:

    #include "tensorflow/core/framework/op.h"
    #include "tensorflow/core/framework/op_kernel.h"

    REGISTER_OP("DGridGpu")
        .Input("grid: int32")
        .Attr("start_x: int = 0")
        .Attr("start_y: int = 0")
        .Attr("stop_x: int = 28")
        .Attr("stop_y: int = 28")
        .Attr("size_x: int = 28")
        .Attr("size_y: int = 28")
        .Attr("wall_height: float = 2.5")
        .Output("prev: int32");

    using namespace tensorflow;

    void run();

    class DGridGpuOp : public OpKernel {
      public:
      explicit DGridGpuOp(OpKernelConstruction* context) : OpKernel(context) {

      }

      void Compute(OpKernelContext* context) override {
         run();
      }

    };

    REGISTER_KERNEL_BUILDER(Name("DGridGpu").Device(DEVICE_GPU), DGridGpuOp);

这是d_grid_gpu.cu.cc文件:

here is the d_grid_gpu.cu.cc file:

    #if GOOGLE_CUDA
    #define EIGEN_USE_GPU
    #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"


    //  content here

#include <stdio.h>
#define SIZE    1024

__global__ void VectorAdd(int *a, int *b, int *c, int n)
{
    int i = threadIdx.x;

    if (i < n)
        c[i] = a[i] + b[i];
}


    void run() {
    int *a, *b, *c;
    int *d_a, *d_b, *d_c;

    a = (int *)malloc(SIZE*sizeof(int));
    b = (int *)malloc(SIZE*sizeof(int));
    c = (int *)malloc(SIZE*sizeof(int));

    cudaMalloc( &d_a, SIZE*sizeof(int));
    cudaMalloc( &d_b, SIZE*sizeof(int));
    cudaMalloc( &d_c, SIZE*sizeof(int));

    for( int i = 0; i < SIZE; ++i )
    {
        a[i] = i;
        b[i] = i;
        c[i] = 0;
    }

    cudaMemcpy( d_a, a, SIZE*sizeof(int), cudaMemcpyHostToDevice );
    cudaMemcpy( d_b, b, SIZE*sizeof(int), cudaMemcpyHostToDevice );
    cudaMemcpy( d_c, c, SIZE*sizeof(int), cudaMemcpyHostToDevice );

    // blocks, threads
    VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);

    cudaMemcpy( c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost );

    for( int i = 0; i < 10; ++i)
        printf("output : c[%d] = %d\n", i, c[i]);

    free(a);
    free(b);
    free(c);

    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
}

    #endif

这是我使用的代码构建op:

here is the code I use to build the op:

    TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')

    nvcc -std=c++11 -c -o d_grid_gpu.cu.o d_grid_gpu.cu.cc \
    -I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC --expt-relaxed-constexpr

    g++ -std=c++11 -shared -o d_grid_gpu.so d_grid_gpu.cc \
    d_grid_gpu.cu.o -I $TF_INC -fPIC -lcudart -D_GLIBCXX_USE_CXX11_ABI=0 -L /usr/lib/x86_64-linux-gnu/

这就是我所拥有的。就像我说的那样,CUDA代码什么都不做,但是整个操作都是编译的。我有调用此库的python代码(未包含在内)。我确实相信我的cuda在工作。我正在使用ubuntu 16.10和cuda 8

that's all I have. as I say, the cuda code does nothing, but the whole op compiles. I have python code that calls this library which I have not included. I do believe my cuda is working. I am using ubuntu 16.10 and cuda 8

编辑-转储前的一些错误:

edit -- some of the error before the dump:

    *** Error in `/usr/bin/python': free(): invalid next size (fast): 0x00007f34f4033ba0 ***
    ======= Backtrace: =========
    /lib/x86_64-linux-gnu/libc.so.6(+0x790cb)[0x7f35664f20cb]
    /lib/x86_64-linux-gnu/libc.so.6(+0x8275a)[0x7f35664fb75a]
    /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f35664ff18c]
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(+0x22223a1)[0x7f354d7953a1]
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(+0x222b6a2)[0x7f354d79e6a2]
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(+0x221fd90)[0x7f354d792d90]
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x3c8)[0x7f354d9f4ce8]
    /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x22)[0x7f354d9f44b2]
    /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb8f0)[0x7f354b0408f0]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x770a)[0x7f356684770a]
    /lib/x86_64-linux-gnu/libc.so.6(clone+0x5f)[0x7f35665810af]
    ======= Memory map: ========
    200000000-200100000 rw-s 3cf997000 00:06 570                             /dev/nvidiactl
    ... more memory map here...

我希望这会有所帮助。我尝试了一下,片刻以为它可以正常工作,但是我无法重现结果。

I hope this is helpful. I tried this out and for a moment I thought it was working but I can not reproduce the results.

编辑:我更改了代码

d_grid_gpu.cc

d_grid_gpu.cc

    #include "tensorflow/core/framework/op.h"
    #include "tensorflow/core/framework/op_kernel.h"

    REGISTER_OP("DGridGpu")
        .Input("grid: int32")
        .Output("prev: int32");

    using namespace tensorflow;

        void run(const int * in, int * out);

    class DGridGpuOp : public OpKernel {
      public:
      explicit DGridGpuOp(OpKernelConstruction* context) : OpKernel(context) {


      }

      void Compute(OpKernelContext* context) override {


        Tensor* prev_h = NULL;

        const Tensor& grid_h = context->input(0);

        auto grid = grid_h.flat<int32>();    

        OP_REQUIRES_OK(context, context->allocate_output(
                                     0, 
                                     TensorShape({64}), &prev_h));

        auto prev = prev_h->flat<int32>();

        run(grid.data(), prev.data()); // do something to grid_host and move it to prev_host

        //exit
      }

    };

    REGISTER_KERNEL_BUILDER(Name("DGridGpu").Device(DEVICE_GPU), DGridGpuOp);
    //REGISTER_KERNEL_BUILDER(Name("DGridGpu").Device(DEVICE_CPU), DGridGpuOp);

d_grid_gpu.cu.cc

d_grid_gpu.cu.cc

    #if GOOGLE_CUDA
    #define EIGEN_USE_GPU
    #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"

    #include <stdio.h>
    #define SIZE    20

        __global__ void VectorAdd( const int *in, int *out,  int n)
        {
            int i = threadIdx.x;

            if (i < n)
                out[i] = in[i] + out[i];
        }


        void run(const int * in, int * out) {

            VectorAdd<<< 1, SIZE >>>(  in, out, SIZE);

        }

    #endif


推荐答案

简而言之,更大的问题是您试图自己管理内存,但是Tensorflow已经知道如何为您执行此操作。您应该使用Tensorflow的机制来管理内存;您不需要任何 malloc 免费 cudaMalloc , cudaFree cudaMemcpy 代码。

Briefly, the larger issue is you are attempting to manage memory yourself, but Tensorflow already knows how to do this for you. You should use Tensorflow's mechanisms for managing memory; you do not need any of the malloc, free, cudaMalloc, cudaFree, cudaMemcpy code.

我会首先从教程中修改GPU内核:

I would start by modifying the GPU kernel from the tutorial:

https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/g3doc/how_tos/adding_an_op/cuda_op_kernel.cc
https:// github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/g3doc/how_tos/adding_an_op/cuda_op_kernel.cu.cc

内核接收到输入已经在GPU内存中分配的缓冲区。您只需要将其地址传递给GPU内核即可。

The kernel receives as inputs buffers that are already allocated in GPU memory. You just need to pass their addresses to the GPU kernel.

要为输出分配缓冲区,应使用 OpKernelContext :: allocate_output() 分配张量并将其地址传递给您的GPU内核。还有一个 allocate_temp()用于分配临时缓冲区。上面的示例以这种方式分配其输出。默认情况下,在GPU上,这会在GPU内存中分配一个缓冲区。因此,您无需自己分配内存或将东西从设备复制到主机。

To allocate a buffer for your output, you should use OpKernelContext::allocate_output() to allocate a Tensor and pass its address to your GPU kernel. There's also an allocate_temp() for allocating temporary buffers. The example above allocates its outputs this way. By default, on GPU, this allocates a buffer in GPU memory. So there no need to allocate memory yourself or copy things from device to host.

您当前正在填充作为主机输入的缓冲区,然后将其复制到主机上。手动连接到GPU。使用GPU填充该缓冲区,或使用单独的Tensorflow CPU 运算符创建输入,可能是最简单的方法。 Tensorflow会处理主机->必要时复制设备。

You are currently populating an buffer fed as input to your kernel on the host and then copying it to the GPU manually. It might be easiest to either populate that buffer using the GPU, or to use a separate Tensorflow CPU operator to create the input; Tensorflow takes care of host -> device copies when necessary.

我希望这会有所帮助!

这篇关于使用自定义tensorflow gpu op进行编译时遇到麻烦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆