蛮力“散景"卷积和避免TDR. [英] Brute force "bokeh" convolution and TDR avoidance.

查看:74
本文介绍了蛮力“散景"卷积和避免TDR.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行的卷积可能需要一秒到16到20秒的时间,具体取决于源图像的大小.在典型的屏幕分辨率"屏幕上,平均时间约为2-4秒.图片.

I am performing a convolution that can take anywhere from a second to 16 to 20 seconds, depending on the size of the source image.  The average time is about 2-4 seconds on a typical "screen resolution" image.

我正在经历TDR,也就是显示设备已停止响应并已重置".并且我正在探索如何避免这些情况.简单地恢复不是一种选择,因为我不拥有"自己的财产.恢复过程似乎存在的内存 取消分配.

I am experiencing TDRs, aka, "the display device has stopped responding and has been reset" and I am exploring how to avoid these.  Simply recovering is not an option because I do not "own" the memory that the recovery process seems to de-allocate.

我的算法将附近像素乘以定义为自定义形状(例如六边形或八边形)的内核中的值(例如摄像机镜头系统中的虹膜形状),因此该算法不可分离(aka,x和y)分别需要计算 每个像素都用蛮力.

My algorithm multiplies nearby pixels by values in a kernel that is defined as a custom shape, such as a hexagon or octagon (like the shape of the iris in a camera lens system) so the algorithm is not separable (aka, x and y separately) it needs to compute each pixel by brute force.

我的第一个想法是将算法一次分成90条扫描线等条带.在这方面,我遇到了几个问题,而且我从未确认仅对parallel_for_each进行多次调用就足以解决问题.一世 需要帮助来理解问题,并且还需要解决与条带"相关的各种问题.我尝试过的方法.

My first thought was to separate the algorithm into strips such as 90 scan lines at a time.  I had several problems in this regard, and I never confirmed that just making multiple calls to parallel_for_each was enough to solve the problems.  I need help in understanding the problem, and also solving various problems related to the "strip" method I tried.

我遇到的第一个问题是这个.调用parallel_for_each需要一个范围.我不知道如何定义作为数组一部分的范围. (例如,扫描线90-180为0-1000)意思是,我想通过全部 数组视图,但仅为其一部分创建线程.

The first problem I had was this.  A call to parallel_for_each requires an extent.  I do not know how to define an extent that is a portion of an array.  (scan-line 90-180 of 0-1000, for example)  Meaning, I would want to pass the full array view, but only create threads for a portion of it.

我尝试制作多个数组视图,但是由于我正在进行卷积运算,因此可以访问 相邻像素,这不起作用,因为GPU内存中不存在阵列视图之外的像素.

I tried to make multiple array views, but since I'm doing a convolution that can access  neighboring pixels, this did not work, because pixels outside of the array view did not exist in GPU memory.

我可以将其他参数传递给parallel_for_each,让我可以控制要处理的带区的一部分,但这意味着我必须多次往返于GPU多次复制全部内存.时代,我仍然 不知道如何只为数组视图的一部分创建线程.

I could pass additional parameters to parallel_for_each that lets me control the part of the strip I want to work on, but that would mean I would have to copy the full amount of memory many times back and forth for and from the GPU many times, and I still don't know how to create threads for only part of the array view.

所以我希望我已经正确,清楚地解释了这一切.如果我的方向不对,请有人打我,也不要只是说我应该用谷歌搜索一下,因为这是一个复杂的问题,没有明显的解决方案.感谢 预先的帮助.

So I hope I have explained this all correctly and clearly.  Somebody please slap me if I'm heading in the wrong direction here, and please don't just say I should google it because this is a complex problem with no obvious solution.  Thanks for the help in advance.

推荐答案

丹丹,

我已经做到了这一点(将问题分解为扫描线),而在AMP中却没有很多麻烦.我已经完成了以下操作:

I've done exactly this (splitting the problem into scanline strips) without any troubles many times in AMP. I've done something like the following:

1)用表示图像尺寸的2D范围声明array_view:

1) Declare your array_view with the 2D extent expressing the size of the image:

范围< 2> imageExtent(ImageRows,ImageColumns);
array_view< const int,2> inputImage(imageExtent,& inputImageIntBuffer [0]);
array_view< int,2> processingImage(imageExtent,& processedImageIntBuffer [0]);

extent<2> imageExtent(ImageRows, ImageColumns);
array_view<const int, 2> inputImage(imageExtent, &inputImageIntBuffer[0]);
array_view<int, 2> processedImage(imageExtent, &processedImageIntBuffer[0]);

2)根据要处理的图像部分的大小声明计算范围:

2) Declare a compute extent based on the size of the portion of the image you wish to process:

范围< 2> computeExtent(NumScanlines,ImageWidth);

extent<2> computeExtent(NumScanlines, ImageWidth);

3)循环遍历处理整个图像所需的图像部分数量(如果不能将图像除以要处理的部分大小,则最后可能会有一些清理内核调用):

3) Loop over the number of image portions required to process the entire image (there may be some cleanup kernel call at the end if the image can't be divided equally by the portion size you are processing):

int NumIterations = ImageRows/NumScanlines;

int NumIterations = ImageRows / NumScanlines;

for(int x = 0; x< NumIterations; x ++)
{
    parallel_for_each(computeExtent,[=] index< 2" computeIndex)strict(amp){...});
}
processingImage.synchronize();

for(int x = 0; x < NumIterations; x++)
{
    parallel_for_each(computeExtent, [=]index<2> computeIndex) restrict(amp) {...});
}
processedImage.synchronize();

别忘了将"computeIndex"的扫描线位置偏移到屏幕上.在内核顶部考虑循环迭代,即:

Don't forget to offset the scanline location of "computeIndex" to account for the loop iteration at the top of your kernel i.e.:

int CurrentmageColumn = computeIndex [1];
int CurrentImageRow = computeIndex [0] +(x * NumScanlines);

int CurrentmageColumn = computeIndex[1];
int CurrentImageRow = computeIndex[0] + (x * NumScanlines);

该策略非常快,因为array_views在循环期间保持不变,并且设备驱动程序将在内核第一次在循环中遇到该内核时对其进行编译,然后以全速运行内核.

This strategy is very fast since the array_views remain resident during the loop and the device driver will compile the kernel the first time it encounters it in the loop and then run the kernel at full speed thereafter.

-L


这篇关于蛮力“散景"卷积和避免TDR.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆