是否可以使用 SSE 对嵌套进行矢量化? [英] Is it possible to vectorize this nested for with SSE?

查看:37
本文介绍了是否可以使用 SSE 对嵌套进行矢量化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从来没有为 SSE 优化编写过汇编代码,如果这是一个菜鸟问题,很抱歉.在 this 中解释了如何矢量化 for 带有条件语句.但是,我的代码(取自 here )的形式如下:

I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form:

   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      for(int i=-halfWidth; i<=halfWidth; ++i)
      {
         const float rx = ofsx + j * a12;
         const float ry = ofsy + j * a22;
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            // bilinear interpolation
            *out++ =
               (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
               (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
         } else {
            *out++ = 0;
         }
      }
   }

所以,根据我的理解,链接的文章有几个不同之处:

So, from my understanding, there are several differences with the linked article:

  1. 这里我们有一个嵌套的for:我一直在vectroization中看到一层for,从未见过嵌套循环
  2. if 条件基于标量值(x 和 y)而不是基于数组:我如何将链接的示例适应于此?
  3. out 索引不是基于 ij(所以它不是 out[i]out[j]):我如何用这种方式填写 out?
  1. Here we have a nested for: I've always seen one level for in vectroization, never seen a nested loop
  2. The if condition is based on scalar values (x and y) and not on the array: how can I adapt the linked example to this?
  3. The out index isn't based on i or j (so it's not out[i] or out[j]): how can I fill out in this way?

特别是我很困惑,因为 for 索引总是用作数组索引,而这里用于计算变量,同时向量逐周期递增

In particular I'm confused because for indexes are always used as array indexes, while here are used to compute variables while the vector is incremented cycle by cycle

我将 icpc-O3 -xCORE-AVX2 -qopt-report=5 和其他一些优化标志一起使用.根据英特尔顾问的说法,这不是矢量化的,并且使用 #pragma omp simd 会生成 warning #15552:loop was not vectorized with "simd"

I'm using icpc with -O3 -xCORE-AVX2 -qopt-report=5 and a bunch of others optimization flags. According to Intel Advisor, this is not vectorized, and using #pragma omp simd generates warning #15552: loop was not vectorized with "simd"

推荐答案

双线性插值是一种相当棘手的矢量化操作,我不会在您的第一个 SSE 技巧中尝试它.问题是您需要获取的值没有很好地排序.它们有时会重复,有时会被跳过.好消息是,插入图像是一种常见的操作,您可能会找到一个预先编写的库来执行此操作,例如 OpenCV

Bilinear interpolation is a rather tricky operation to vectorize, and I wouldn't try it for your first SSE trick. The problem is that the values you need to fetch are not nicely ordered. They're sometimes repeated, sometimes skipped. The good news is, interpolating images is a common operation, and you can likely find a pre-written library to do that, like OpenCV

remap() 总是一个不错的选择.只需构建两个 wx 和 wy 数组,代表每个像素的分数源位置,然后让 remap() 进行插值.

remap() is always a good choice. Just build two arrays of wx and wy which represent the fractional source locations of each pixel, and let remap() do the interpolation.

然而,在这种情况下,它看起来像一个仿射变换.也就是说,分数源像素通过 2x3 矩阵乘法与源像素相关.那是偏移量和 a11/a12/a21/a22 变量.OpenCV 有这样的转换.在此处阅读:http://docs.opencv.org/3.1.0/d4/d61/tutorial_warp_affine.html

However, in this case, it looks like an affine transform. That is, the fractional source pixel is related to the source pixel by a 2x3 matrix multiplication. That's the offset and a11/a12/a21/a22 variables. OpenCV has such a transform. Read about it here: http://docs.opencv.org/3.1.0/d4/d61/tutorial_warp_affine.html

您所要做的就是将您的输入变量映射到矩阵形式并调用仿射变换.

All you'll have to do is map your input variables into matrix form and call the affine transform.

这篇关于是否可以使用 SSE 对嵌套进行矢量化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆