C ++数组Halide图像(和背面) [英] C++ array to Halide Image (and back)

查看:545
本文介绍了C ++数组Halide图像(和背面)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始使用Halide,虽然我掌握了它的设计的基本原则,但我正在努力处理有效地计划计算所需的细节(read:magic)。



我在一个使用Halide的MWE下面发布了一个将数组从一个位置复制到另一个位置。我假设这将编译下来只有一些指令,并采取小于1微秒运行。相反,它产生4000行的汇编,需要40毫秒运行!因此,在我的理解中,我有一个重要的洞。


  1. 在<$ c中包装现有数组的规范方式是什么$ c> Halide :: Image ?

  2. 如何计划函数 copy $ p> #include< Halide.h>

    使用命名空间Halide;

    void _copy(uint8_t * in_ptr,uint8_t * out_ptr,const int M,const int N){

    Image< uint8_t> in(Buffer(UInt(8),N,M,0,0,in_ptr));
    图片< uint8_t> out(Buffer(UInt(8),N,M,0,0,out_ptr));

    Var x,y;
    Func copy;
    copy(x,y)= in(x,y);
    copy.realize (out);
    }

    int main(void){
    uint8_t in [10000],out [10000];
    _copy(in,out,100,100);
    }

    编译标志

      clang ++ -O3 -march = native -std = c ++ 11 -Iinclude -Lbin -lHalide copy.cpp 


    解决方案

    让我从第二个问题开始: _copy 需要很长时间,因为它需要编译Halide代码到x86机器代码。 IIRC, Func 缓存机器码,但由于 copy 是本地的 _copy 那个缓存不能重复使用。无论如何,调度 copy 很简单,因为它是一个点序操作:首先,它可能是有意义的向量化它。第二,它可能有意义的并行化(取决于有多少数据)。例如:


    copy.vectorize(x,32).parallel(y);




    将向量化 x ,向量大小为32,并沿 y 。 (我从内存中制作,可能会有一些关于正确名称的混乱。)当然,这样做也可能会增加编译时间...



    是没有好的调度的食谱。我通过查看 compile_to_lowered_stmt 的输出并对代码进行分析。我也使用由 Halide :: Generator 提供的AOT编译,这确保我只测量代码的运行时而不是编译时间。



    你的另一个问题是,如何将现有数组包装在 Halide :: Image 中。我不这样做,主要是因为我使用AOT编译。但是,内部Halide使用一个名为 buffer_t 的类型来处理所有图像相关。还有一个叫做 Halide :: Buffer 的C ++包装,使得使用 buffer_t 更容易一些,我认为它也可以用于 Func :: implements 而不是 Halide :: Image 。关键是:如果你理解 buffer_t ,你可以把几乎所有东西都包装成Halide可以消化的东西。


    I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.

    I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.

    1. What is the canonical way of wrapping an existing array in a Halide::Image?
    2. How should the function copy be scheduled to perform the copy efficiently?

    Minimal working example

    #include <Halide.h>
    
    using namespace Halide;
    
    void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {
    
        Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
        Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));
    
        Var x,y;
        Func copy;
        copy(x,y) = in(x,y);
        copy.realize(out);
    }
    
    int main(void) {
        uint8_t in[10000], out[10000];
        _copy(in, out, 100, 100);
    }
    

    Compilation Flags

    clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp
    

    解决方案

    Let me start with your second question: _copy takes a long time, because it needs to compile Halide code to x86 machine code. IIRC, Func caches the machine code, but since copy is local to _copy that cache cannot be reused. Anyways, scheduling copy is pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:

    copy.vectorize(x, 32).parallel(y);

    will vectorize along x with a vector size of 32 and parallelize along y. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...

    There is no recipe for good scheduling. I do it by looking at the output of compile_to_lowered_stmt and profiling the code. I also use the AOT compilation provided by Halide::Generator, this makes sure that I only measure the runtime of the code and not the compile time.

    Your other question was, how to wrap an existing array in a Halide::Image. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type called buffer_t for everything image related. There is also C++ wrapper called Halide::Buffer that makes using buffer_t a little easier, I think it can also be used in Func::realize instead of Halide::Image. The point is: If you understand buffer_t you can wrap almost everything into something digestible by Halide.

    这篇关于C ++数组Halide图像(和背面)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆