优化2D旋转 [英] Optimising 2D rotation

查看:116
本文介绍了优化2D旋转的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出了在2D空间中旋转点的经典公式:

Given the classic formula for rotating a point in 2D space:

cv::Point pt[NPOINTS];
cv::Point rotated[NPOINTS];
float angle = WHATEVER;
float cosine = cos(angle);
float sine = sin(angle);

for (int i = 0; i < NPOINTS; i++)
{
    rotated[i].x = pt[i].x * cosine - pt[i].y * sine;
    rotated[i].y = pt[i].x * sine   + pt[i].y * cosine;
}

给定NPOINTS是32,并且阵列对齐,那么如何优化SSE或AVX的代码?在这里和其他地方搜索并没有发现任何有用的信息,我在这里迷路了:

Given NPOINTS is 32 and the arrays are aligned, how would one go about optimising the code for SSE or AVX? Searching around here and elsewhere didn't turn up anything useful, and I got lost about here:

__m128i onePoint = _mm_set_epi32(pt[i].x, pt[i].y, pt[i].x, pt[i].y);
__m128 onefPoint = _m128_cvtepi32_ps(onePoint);
__m128 sinCos = _mm_set_ps(cosine, -sine, sine, cosine);
__m128 rotated = _mm_mul_ps(onefPoint, sinCos);

但是如何从[y*cosine, -x*sine, x*sine, y*cosine]转到[y*cosine + -x*sine, x*sine + y*cosine]?这是最好的方法吗?它可以轻松缩放到__m512吗?

But how to go from [y*cosine, -x*sine, x*sine, y*cosine] to [y*cosine + -x*sine, x*sine + y*cosine]? Is this the best approach? Does it easily scale to __m512?

更新:我做了一些研究,现在大概有了:

UPDATE: I did a little more research and I now have approximately:

__m128i onePoint = _mm_set_epi32(pt[i].x, pt[i].y, pt[i].x, pt[i].y);
__m128 onefPoint = _m128_cvtepi32_ps(onePoint);
__m128i twoPoint = _mm_set_epi32(pt[i+1].x, pt[i+1].y, pt[i+1].x, pt[i+1].y);
__m128 twofPoint = _m128_cvtepi32_ps(twoPoint);
__m128 sinCos = _mm_set_ps(cosine, -sine, sine, cosine);
__m128 rotated1 = _mm_mul_ps(onefPoint, sinCos);
__m128 rotated2 = _mm_mul_ps(twofPoint, sinCos);
__m128 added = _mm_hadd_ps(rotated1, rotated2);
__m128i intResult = _mm_cvtps_epi32(added);
int results[4];
_mm_storeu_si128((__m128i*)results, intResult);

这使处理器时间从11%的50%提高到大约6%.扩展到__m256并一次执行四个点可提高速度.这看起来很糟糕,但是我在朝正确的方向前进吗?

This gives a 50% speed-up from 11% of the processor time to about 6%. Expanding to __m256 and doing four points at a time gives another speed-up. This looks quite awful code, but am I heading the right direction?

推荐答案

使用数组结构数组(AoSoA)并一次处理八个点.在下面的代码中,point8是包含八个点的数组的结构.函数rotate_point8旋转8个点,并且具有与函数rotate_point相同的代数结构,函数rotate_point旋转单个点.函数rotate_all8使用AoSoA point8*旋转32点.

Use an array of struct of arrays (AoSoA) and process eight points at a time. In the code below point8 is struct of arrays containing eight points. The function rotate_point8 rotates eight points and has the same algebraic structure as the the function rotate_point which rotates a single point. The function rotate_all8 rotates 32 points using the AoSoA point8*.

单点旋转代码进行4次乘法,1次加法和1次减法.

The single point rotation code does 4 multiplications, one addition, and one subtraction.

如果我们查看 rotate_point8 的程序集,我们会发现GCC展开了循环并执行了每个展开操作有4次SIMD乘法,1次SIMD加法和1次SIMD减法.那是您所能做的最好的事情:一个价格为八.

If we look at the assembly for rotate_point8 we see that GCC unrolled the loop and does 4 SIMD multiplications, one SIMD addition, and one SIMD subtraction per unroll. That's the best you can do: eight for the price of one.

#include <x86intrin.h>
#include <stdio.h>
#include <math.h>

struct point8 {
  __m256 x;
  __m256 y;
};

struct point {
  float x;
  float y;
};

static point rotate_point(point p, float a, float b) {
  point r;
  r.x = p.x*a - p.y*b;
  r.y = p.x*b + p.y*a;
  return r;
}

static point8 rotate_point8(point8 p, float a, float b) {
  __m256 va = _mm256_set1_ps(a), vb = _mm256_set1_ps(b);
  point8 r;
  r.x = _mm256_sub_ps(_mm256_mul_ps(p.x,va), _mm256_mul_ps(p.y,vb));
  r.y = _mm256_add_ps(_mm256_mul_ps(p.x,vb), _mm256_mul_ps(p.y,va));
  return r;
}

void rotate_all(point* points, point* r, float angle) {
  float a = cos(angle), b = sin(angle);
  for(int i=0; i<32; i++) r[i] = rotate_point(points[i], a, b);
}

void rotate_all8(point8* points, point8* r8, float angle) {
  float a = cos(angle), b = sin(angle);
  for(int i=0; i<4; i++) r8[i] = rotate_point8(points[i], a, b);
}

int main(void) {
  float x[32], y[32];
  point p[32], r[32];
  point8 p8[4], r8[4];
  float angle = 3.14159f/4;

  for(int i=0; i<32; i++) y[i] = 1.0*i/31, x[i] = sqrt(1-y[i]*y[i]);
  for(int i=0; i<32; i++) p[i].x = x[i], p[i].y = y[i];
  for(int i=0; i<4; i++) p8[i].x = _mm256_load_ps(&x[8*i]), p8[i].y = _mm256_load_ps(&y[8*i]); 

  for(int i=0; i<32; i++) printf("%f %f\n", p[i].x, p[i].y); puts("");

  rotate_all(p, r, angle);
  for(int i=0; i<32; i++) printf("%f %f\n", r[i].x, r[i].y); puts("");

  rotate_all8(p8, r8, angle);
  for(int i=0; i<4; i++) {
    _mm256_storeu_ps(x, r8[i].x), _mm256_storeu_ps(y, r8[i].y);
    for(int j=0; j<8; j++) printf("%f %f\n", x[j], y[j]);
  }
}

这篇关于优化2D旋转的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆