将数据从GPU复制到CPU [英] Copy data from GPU to CPU

查看:1190
本文介绍了将数据从GPU复制到CPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用C ++ AMP计算矩阵。我使用宽度和高度为3000 x 3000的数组,并重复计算过程20000次:

I am trying to calculate a matrix using C++ AMP. I use an array with width and height of 3000 x 3000 and I repeat the calculating procedure 20000 times:

    //_height=_width=3000
    extent<2> ext(_height,_width);
    array<int, 2> GPU_main(ext,gpuDevice.default_view);
    array<int, 2> GPU_res(ext,gpuDevice.default_view);
    copy(_main, GPU_main);
    array_view<int,2> main(GPU_main);
    array_view<int,2> res(GPU_res);
    res.discard_data();
    number=20000;
    for(int i=0;i<number;i++)
    {
        parallel_for_each(e,[=](index<2> idx)restrict(amp)
        {
           res(idx)=main(idx)+idx[0];//not depend from calculation type
        }
    array_view<TYPE, 2>  temp=res;
    res=main;
    main=temp;
    }
    copy(main, _main);

在计算之前,我将矩阵从主机内存复制到GPU内存,并创建一个 array_view ,代码行从0到7。

Before the calculation I copy my matrix from host memory to GPU memory and create an array_view, code line from 0 to 7.

此后,我开始一个循环来计算某些操作并重复20000次,每次迭代我都开始一个 parallel_for_each 循环,使用C ++ AMP进行计算。

After that I start a loop for calculating some operation and repeat it 20000 times. Every iteration I start a parallel_for_each loop where calculate using C++ AMP.

GPU计算速度非常快,但是当我将结果复制到主机 array _main 时,我发现此操作需要一个很多时间,而且我发现如果我将 number 从20000减少到2000,复制时间也会减少。

The GPU calculates very fast but when I copy the result to host array _main I found that this operation takes a lot of time, and also I found that if I decrease number from 20000 to 2000, the time for copy also decreases.

何时y这是否发生,这是一些同步问题?

Why does this happen, it is some synchronization issue?

推荐答案

您的代码(原样)无法编译,下面是一个固定版本,我认为具有相同的意图。想要从计算时间中分出复制时间,那么最简单的方法是使用 array<> 和显式副本。

Your code (as is) doesn't compile, below is a fixed version which I think has the same intent If you want to break out the time for copying from the compute time then the simplest thing to do is to use array<> and explicit copies.

        int _height, _width;
        _height = _width = 3000;
        std::vector<int> _main(_height * _width); // host data.
        concurrency::extent<2> ext(_height, _width);
        // Start timing data copy
        concurrency::array<int, 2> GPU_main(ext /* default accelerator */);
        concurrency::array<int, 2> GPU_res(ext);
        concurrency::array<int, 2> GPU_temp(ext);
        concurrency::copy(begin(_main), end(_main), GPU_main);
        // Finish timing data copy
        int number = 20000;
        // Start timing compute
        for(int i=0; i < number; ++i)
        {
            concurrency::parallel_for_each(ext,
                [=, &GPU_res, &GPU_main](index<2> idx)restrict(amp)
            {
               GPU_res(idx) = GPU_main(idx) + idx[0];
            });
            concurrency::copy(GPU_res, GPU_temp);       // Swap arrays on GPU
            concurrency::copy(GPU_main, GPU_res);
            concurrency::copy(GPU_temp, GPU_main);
        }
        GPU_main.accelerator_view.wait(); // Wait for compute
        // Finish timing compute
        // Start timing data copy
        concurrency::copy(GPU_main, begin(_main));
        // Finish timing data copy

请注意 wait()调用以强制完成计算。请记住,C ++ AMP命令通常会在GPU上排队工作,并且只有在您使用 wait()显式等待,或者对其进行隐式等待或通过调用隐式等待(例如)时,才能保证已执行该命令。 synchronize() array_view<> 上。为了更好地计时,您应该真正对计算和数据副本分别计时(如上所示)。您可以在此处找到一些基本的计时代码: http://ampbook.codeplex.com/ Timer.h中的SourceControl / changeset / view / 100791#1983676 。在同一文件夹中有一些使用示例。

Note the wait() call to force the compute to finish. Remember that C++AMP commands usually queue work on the GPU and it is only guarenteed to have executed if you explicitly wait, with wait(), or for it or implicitly wait by calling (for example) synchronize() on an array_view<>. To get a good idea of timing you should really time the compute and data copies separately (as shown above). You can find some basic timing code here: http://ampbook.codeplex.com/SourceControl/changeset/view/100791#1983676 in Timer.h There are examples of it's use in the same folder.

但是。我不确定我是否真的会以这种方式编写代码,除非我想打破复制和计算时间。对于完全存在于GPU上的数据,使用 array<> 更为简单,而对于往返于GPU的数据使用 array_view<> 则简单得多。

However. I'm not sure I would really write the code this way unless I wanted to break out the copy and compute times. It is far simpler to use array<> for data that lives purely on the GPU and array_view<> for data that is copied to and from the GPU.

这看起来像下面的代码。

This would look like the code below.

        int _height, _width;
        _height = _width = 3000;
        std::vector<int> _main(_height * _width); // host data.
        concurrency::extent<2> ext(_height, _width);
        concurrency::array_view<int, 2> _main_av(_main.size(), _main); 
        concurrency::array<int, 2> GPU_res(ext);
        concurrency::array<int, 2> GPU_temp(ext);
        concurrency::copy(begin(_main), end(_main), _main_av);
        int number = 20000;
        // Start timing compute and possibly copy
        for(int i=0; i < number; ++i)
        {
            concurrency::parallel_for_each(ext,
                [=, &GPU_res, &_main_av](index<2> idx)restrict(amp)
            {
               GPU_res(idx) = _main_av(idx) + idx[0];
            });
            concurrency::copy(GPU_res, GPU_temp);  // Swap arrays on GPU
            concurrency::copy(_main_av, GPU_res);
            concurrency::copy(GPU_temp, _main_av);
        }
        _main_av.synchronize();  // Will wait for all work to finish
        // Finish timing compute & copy

现在声明仅在GPU上需要的数据在GPU上并且该数据这样就需要同步。更清晰,更少的代码。

Now the data that is only required on the GPU is declared to be on the GPU and the data that needs to be synchronized is declared as such. Clearer and less code.

通过阅读有关C ++ AMP的书,您可以找到更多有关此的信息:)

You can find out more about this by reading my book on C++ AMP :)

这篇关于将数据从GPU复制到CPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆