是否可以使用 SSE 对嵌套进行矢量化? [英] Is it possible to vectorize this nested for with SSE?
问题描述
我从来没有为 SSE 优化编写过汇编代码,如果这是一个菜鸟问题,很抱歉.在 this 中解释了如何矢量化 for
带有条件语句.但是,我的代码(取自 here )的形式如下:
I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for
with a conditional statement. However, my code (taken from here ) is of the form:
for (int j=-halfHeight; j<=halfHeight; ++j)
{
for(int i=-halfWidth; i<=halfWidth; ++i)
{
const float rx = ofsx + j * a12;
const float ry = ofsy + j * a22;
float wx = rx + i * a11;
float wy = ry + i * a21;
const int x = (int) floor(wx);
const int y = (int) floor(wy);
if (x >= 0 && y >= 0 && x < width && y < height)
{
// compute weights
wx -= x; wy -= y;
// bilinear interpolation
*out++ =
(1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x) + wx * im.at<float>(y,x+1)) +
( wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
} else {
*out++ = 0;
}
}
}
所以,根据我的理解,链接的文章有几个不同之处:
So, from my understanding, there are several differences with the linked article:
- 这里我们有一个嵌套的
for
:我一直在vectroization中看到一层for
,从未见过嵌套循环 - if 条件基于标量值(x 和 y)而不是基于数组:我如何将链接的示例适应于此?
out
索引不是基于i
或j
(所以它不是out[i]
或out[j]
):我如何用这种方式填写out
?
- Here we have a nested
for
: I've always seen one levelfor
in vectroization, never seen a nested loop - The if condition is based on scalar values (x and y) and not on the array: how can I adapt the linked example to this?
- The
out
index isn't based oni
orj
(so it's notout[i]
orout[j]
): how can I fillout
in this way?
特别是我很困惑,因为 for
索引总是用作数组索引,而这里用于计算变量,同时向量逐周期递增
In particular I'm confused because for
indexes are always used as array indexes, while here are used to compute variables while the vector is incremented cycle by cycle
我将 icpc
与 -O3 -xCORE-AVX2 -qopt-report=5
和其他一些优化标志一起使用.根据英特尔顾问的说法,这不是矢量化的,并且使用 #pragma omp simd
会生成 warning #15552:loop was not vectorized with "simd"
I'm using icpc
with -O3 -xCORE-AVX2 -qopt-report=5
and a bunch of others optimization flags. According to Intel Advisor, this is not vectorized, and using #pragma omp simd
generates warning #15552: loop was not vectorized with "simd"
推荐答案
双线性插值是一种相当棘手的矢量化操作,我不会在您的第一个 SSE 技巧中尝试它.问题是您需要获取的值没有很好地排序.它们有时会重复,有时会被跳过.好消息是,插入图像是一种常见的操作,您可能会找到一个预先编写的库来执行此操作,例如 OpenCV
Bilinear interpolation is a rather tricky operation to vectorize, and I wouldn't try it for your first SSE trick. The problem is that the values you need to fetch are not nicely ordered. They're sometimes repeated, sometimes skipped. The good news is, interpolating images is a common operation, and you can likely find a pre-written library to do that, like OpenCV
remap()
总是一个不错的选择.只需构建两个 wx 和 wy 数组,代表每个像素的分数源位置,然后让 remap()
进行插值.
remap()
is always a good choice. Just build two arrays of wx and wy which represent the fractional source locations of each pixel, and let remap()
do the interpolation.
然而,在这种情况下,它看起来像一个仿射变换.也就是说,分数源像素通过 2x3 矩阵乘法与源像素相关.那是偏移量和 a11/a12/a21/a22 变量.OpenCV 有这样的转换.在此处阅读:http://docs.opencv.org/3.1.0/d4/d61/tutorial_warp_affine.html
However, in this case, it looks like an affine transform. That is, the fractional source pixel is related to the source pixel by a 2x3 matrix multiplication. That's the offset and a11/a12/a21/a22 variables. OpenCV has such a transform. Read about it here: http://docs.opencv.org/3.1.0/d4/d61/tutorial_warp_affine.html
您所要做的就是将您的输入变量映射到矩阵形式并调用仿射变换.
All you'll have to do is map your input variables into matrix form and call the affine transform.
这篇关于是否可以使用 SSE 对嵌套进行矢量化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!