使用CUDA在GPU上进行图像处理的多线程 [英] Multithreading for image processing at GPU using CUDA

查看:276
本文介绍了使用CUDA在GPU上进行图像处理的多线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题陈述:
我必须连续处理从相机捕获的800万像素图像。在它上面必须有几种图像处理算法,如颜色插值,颜色变换等。这些操作在CPU处需要很长时间。所以,我决定在GPU使用CUDA内核做这些操作。我已经写了一个工作CUDA内核的颜色变换。但是,我仍然需要一些更多的性能提升。

Problem Statement: I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.

基本上有两个计算时间:

There are basically two computational times:


  1. 复制源映像从CPU到GPU,反之亦然。

  2. 在GPU上处理源映像

  1. Copying the source image from CPU to GPU and vice-versa
  2. Processing of the source image at GPU

当图像从CPU复制到GPU ... ...没有其他发生。类似地,当GPU处理图像时,不会发生其他事情。

when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.

我的想法:我想做多线程这样我可以节省一些时间。我想捕获下一个图像,而前一个图像的处理正在GPU上进行。并且,当GPU完成对先前图像的处理时,下一个图像已经存在,以便从CPU转移到GPU。

MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.

我需要什么:我是全新的多线程世界。我正在看一些教程和一些其他的东西,以了解更多。 因此,我正在寻找有关正确步骤和正确逻辑的建议。

What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.

推荐答案

您的问题非常广泛,我只能想到以下建议:

Since your question is very wide, I can only think of the following advice:

1)使用 CUDA流

1) Use CUDA streams

比一个CUDA流,CPU-> GPU之间的存储器传输,GPU处理和GPU-> CPU之间的存储器传输之间的重叠。这样,下一个图像的图像处理可以在结果返回时开始。

When using more than one CUDA stream, the memory transfer between CPU->GPU, the GPU processing and the memory transfer between GPU->CPU can overlap. This way the image processing of the next image can already begin while the result is transferred back.

您还可以分解每个帧。每帧使用 n 流,并启动具有偏移量的图像处理内核 n 次。

You can also decompose each frame. Use n streams per frame and launch the image processing kernels n times with an offset.

2)应用生产者 - 消费者计划

2) Apply the producer-consumer scheme

制作者线程从相机捕获帧,并将它们存储在线程安全容器。消费者线程从该源容器提取帧,使用其自己的CUDA流将其上传到GPU,启动内核并将结果复制回主机。
每个消费者线程将在尝试从源容器获取新映像之前与其流同步。

The producer thread captures the frames from the camera and stores them in a thread-safe container. The consumer thread(s) fetch(es) a frame from this source container, upload(s) it to the GPU using its/their own CUDA stream(s), launches the kernel and copies the result back to the host. Each consumer thread would synchronize with its stream(s) before trying to get a new image from the source container.

一个简单的实现可能如下所示:

A simple implementation could look like this:

#include <vector>
#include <thread>
#include <memory>

struct ThreadSafeContainer{ /*...*/ };

struct Producer
{
    Producer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
    {

    }

    void run()
    {
        while(true)
        {
            // grab image from camera
            // store image in container
        }
    }

    std::shared_ptr<ThreadSafeContainer> container;
};

struct Consumer
{
    Consumer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
    {
        cudaStreamCreate(&stream);
    }
    ~Consumer()
    {
        cudaStreamDestroy(stream);
    }

    void run()
    {
        while(true)
        {
            // read next image from container

            // upload to GPU
            cudaMemcpyAsync(...,...,...,stream);
            // run kernel
            kernel<<<..., ..., ..., stream>>>(...);
            // copy results back
            cudaMemcpyAsync(...,...,...,stream);

            // wait for results 
            cudaStreamSynchronize(stream);

            // do something with the results
        }
    }

    std::shared_ptr<ThreadSafeContainer> container;
    cudaStream_t stream; // or multiple streams per consumer
};


int main()
{
    // create an instance of ThreadSafeContainer which whill be shared between Producer and Consumer instances 
    auto container = std::make_shared<ThreadSafeContainer>();

    // create one instance of Producer, pass the shared container as an argument to the constructor
    auto p = std::make_shared<Producer>(container);
    // create a separate thread which executes Producer::run  
    std::thread producer_thread(&Producer::run, p);

    const int consumer_count = 2;
    std::vector<std::thread> consumer_threads;
    std::vector<std::shared_ptr<Consumer>> consumers;

    // create as many consumers as specified
    for (int i=0; i<consumer_count;++i)
    {
        // create one instance of Consumer, pass the shared container as an argument to the constructor
        auto c = std::make_shared<Consumer>(container);
        // create a separate thread which executes Consumer::run
        consumer_threads.push_back(std::thread(&Consumer::run, c));
    }

    // wait for the threads to finish, otherwise the program will just exit here and the threads will be killed
    // in this example, the program will never exit since the infinite loop in the run() methods never end
    producer_thread.join();
    for (auto& t : consumer_threads)
    {
        t.join();
    }

    return 0;
}

这篇关于使用CUDA在GPU上进行图像处理的多线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆