在C中有效地取整数向量的绝对值 [英] Efficiently taking the absolute value of an integer vector in C

查看：55 发布时间：2021/6/12 20:51:30 c optimization vectorization premature-optimization

本文介绍了在C中有效地取整数向量的绝对值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

任务是将 C 整数数组的每个元素设置为其绝对值.我正在努力尽可能有效地做到这一点.以下是我所做的一系列优化.请告诉我这些是否真的是优化，是否可以做更多的优化！

The task is to set each element of a C integer array to its absolute value. I'm trying to do it as efficiently as possible. Below are a progression of optimizations that I've made. Please tell me if these are actually optimizations at all, and if any more can be made!

函数的第一个参数是一个整数数组，第二个参数是该数组的整数大小.

The first parameter of the function will be an integer array, and the second will be an integer size of that array.

以下是标准实现:

void absolute (int array[], int n){
  for(int i = 0; i < n; i++)
    if(array[i] < 0)
      array[i] = - array[i];
}

这足以满足任何介绍性编程课程教授的需求，但我想多玩一点，也许可以在此过程中学习一些有关优化的知识.

That's plenty to satisfy any introductory programming course professor, but I'd like to play with it a little more and maybe learn something about optimization along the way.

基于https://stackoverflow.com/a/2074403，一个无分支的绝对值:

Based on https://stackoverflow.com/a/2074403, a branchless absolute value:

void absolute (int array[], int n){
  for(int i = 0; i < n; i++){
    uint32_t temp = array[i] >> 31;     // make a mask of the sign bit
    array[i] ^= temp;                   // toggle the bits if value is negative
    array[i] += temp & 1;               // add one if value was negative
  }
}

基于与零的比较，效率更高，并且不需要额外的变量:

Based on comparisons to zero being more efficient, and not needing an extra variable sitting around:

void absolute (int array[], int n){
  for(n--; n >= 0;){
    uint32_t temp = array[n] >> 31;
    array[n] ^= temp;
    array[n] += temp & 1;
  }
}

(不过这个向量化了吗?)

(does this vectorize anymore though?)

就我所知.可以做更多的事情来优化这个功能吗?

That's as far as I got. Can more be done to optimize this function?

推荐答案

我个人比较喜欢这个问题.正是这些问题让您想知道是否有办法让我自己的代码变得更好.

Personally I rather like this question. It's questions like these that get you wondering if perhaps there is a way to make my own code better.

您的最终优化不正确，因为它初始化了 n--，但 n 永远不会再次递减.要纠正此问题，您需要 for(n--; n >= 0; n--).虽然结果我发现减少或增加 for 循环没有明显的优势.

Your final optimization is incorrect as it initializes n--, but n is never decremented again. To correct this you need for(n--; n >= 0; n--). Though the results I had found that decrementing or incrementing your for loop contained no noticeable advantage.

如果数组的值不是真正随机分布，我发现第一个实现中使用的简单 if(array[i] < 0) 实际上要快得多.

If the values of the array are not truly randomly distributed I found that the simple if(array[i] < 0) used in the first implementation is actually significantly faster.

这是我用来进行基准测试的代码:

Here's the code I used to benchmark:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <stdint.h>
#ifdef _OPT3
#include <emmintrin.h>
#include <tmmintrin.h>
#endif

int main(int argc, char **argv)
{
        int *array;
        struct timespec tsstart, tsend;
        int ncount = 500000000;
        int i;

        array = malloc(sizeof(int) * ncount);

        for(i = 0; i < ncount; i++)
        {
                array[i] = rand();
#ifdef _DIST
                if(rand() % 100 == 0) // make the values less likely to be negative.
#else
                if(rand() % 2 == 0) // the values are equeally likely to be negaitve as positive.
#endif
                        array[i] = -rand();
        }

        clock_gettime(CLOCK_MONOTONIC, &tsstart);

#ifdef _OPT1
        for(i = 0; i < ncount; i++)
        {
                uint32_t ntemp = array[i] >> 31;
                array[i] ^= ntemp;
                array[i] += ntemp & 1;
        }
#elif _OPT2
        for(ncount--; ncount >= 0; ncount--)
        {
                uint32_t ntemp = array[ncount] >> 31;
                array[ncount] ^= ntemp;
                array[ncount] += ntemp & 1;
        }
#elif _OPT3
        for(i = 0; i < ncount; i+=4)
        {
                __m128i a3_a2_a1_a0 = _mm_loadu_si128((__m128i*)&array[i]);         //Load 4 int32 elements from array.
                a3_a2_a1_a0 = _mm_abs_epi32(a3_a2_a1_a0);                           //Set absolute of 4 int32 elements in single instruction.
                _mm_storeu_si128((__m128i*)(&array[i]), a3_a2_a1_a0);               //Store 4 int32 elements of array.
        }
#elif _OPT4
        for(i = 0; i < ncount; i++)
        {
                array[i] = abs(array[i]); // abs() is actually an intrinsic on gcc and msvc
        }
#else
        for(i = 0; i < ncount; i++)
        {
                if(array[i] < 0)
                {
                        array[i] = -array[i];
                }
        }
#endif

        clock_gettime(CLOCK_MONOTONIC, &tsend);

        printf("start: %ld.%09ld\n", tsstart.tv_sec, tsstart.tv_nsec);
        printf("end: %ld.%09ld\n", tsend.tv_sec, tsend.tv_nsec);

        tsend.tv_sec -= tsstart.tv_sec;
        tsend.tv_nsec -= tsstart.tv_nsec;
        if(tsend.tv_nsec < 0)
        {
                tsend.tv_sec--;
                tsend.tv_nsec = 1000000000 + tsend.tv_nsec;
        }
        printf("diff: %ld.%09ld\n", tsend.tv_sec, tsend.tv_nsec);

        free(array);

        return 0;
}

测试结果

这是我的结果(时间以秒为单位).这些测试在 Intel(R) Xeon(R) CPU W3580 @ 3.33GHz 上运行.gcc (Debian 4.9.2-10) 4.9.2

Test Results

Here's my results (the times are in seconds). These tests were run on Intel(R) Xeon(R) CPU W3580 @ 3.33GHz. gcc (Debian 4.9.2-10) 4.9.2

// Implimentation One (No Optimizations)
$ gcc -O3 -march=native test.c
$ ./a.out
start: 9221396.418007954
end: 9221398.103490309
diff: 1.685482355

// Implimentation One Non Random Distrubution
$ gcc -D_DIST -O3 -march=native test.c
$ ./a.out
start: 9221515.889463124
end: 9221516.255742919
diff: 0.366279795

// Implementation Two (Branchless)
$ gcc -D_OPT1 -O3 -march=native test.c
$ ./a.out
start: 9221472.539690988
end: 9221472.787347636
diff: 0.247656648

// Implementation Three (Branchless Decrement)
$ gcc -D_OPT2 -O3 -march=native test.c
$ ./a.out
start: 9221930.068693139
end: 9221930.334575475
diff: 0.265882336

// Rotem's Implementation (SIMD)
$ gcc -D_OPT3 -O3 -march=native test.c
$ ./a.out
start: 9222076.001094679
end: 9222076.230432423
diff: 0.229337744

// Inuitive abs() Implementation
$ gcc -D_OPT4 -O3 -march=native test.c
$ ./a.out
start: 9222112.523690484
end: 9222112.754820240
diff: 0.231129756
// Inuitive abs() Implementation Without native
$ gcc -D_OPT4 -O3 test.c
$ ./a.out
start: 9223301.744006196
end: 9223301.974097927
diff: 0.230091731

结论

我从中得出的结论是，处理分支预测的硬件优化可能会比任何基于软件的优化显着加快代码执行速度并提高您的速度.通过尝试优化分支，您已经创建了无论处理什么数据都执行相同步骤的代码.因此，虽然它在恒定时间内执行，但如果数据不是完美随机分布的，您实际上可能会降低执行速度.

更新:我在打开编译器优化的情况下做了一些测试，发现不同的结果并不完全支持我之前得出的结论.

Update: I did some tests with compiler optimizations turned on and found different results that don't entirely support the conclusions I came to earlier.

根据我的经验，我发现如果你可以简单地编写更少的代码，那通常是最好的优化方式.似乎指令越少，无论硬件特性如何，它执行得越快.

In my experience I have found that if you can simply write less code, that is normally the best way of optimization. It seems the less instructions, the faster it executes regardless of hardware features.

我期待阅读有关此练习的任何评论.

I look forward to reading any comments on this exercise.

我已经添加了 Rotem 的实现结果.这段代码非常快，并证明您拥有的指令越少，执行时间就越快.Rotem 干得好！

I've added the results of Rotem's implementation. This code is super fast and proves that the less instructions you have, the faster the execution time. Nice work Rotem!

我今天做了一些广泛的测试，发现当像 gcc -O3 这样的编译器优化被打开时，像改变 for 循环计数方式这样的微优化完全没有影响.编译器最终生成对数组指针进行指针比较的程序集，以测试我们是否已到达末尾.

I did some extensive testing today and found that micro optimizations like changing the way the for loop counts has absolutely no effect when compiler optimizations like gcc -O3 are turned on. The compiler ends up generating assembly that does pointer comparison on the array pointer to test if we've reached the end.

优化 Rotem 提供的 SSE 代码在编译器使用 gcc -O3 运行时也没有任何区别，因为它在 16 字节边界上正确对齐内存，从而删除了 _mm_loadu_si128()/_mm_storeu_si128() 必要性.

Optimizing the SSE code provided by Rotem also makes no difference when the compiler is run with gcc -O3 as it properly aligns the memory on a 16 Byte boundary which removes the _mm_loadu_si128()/_mm_storeu_si128() necessity.

我添加了另一个使用简单直观的 abs() 函数的实现.事实证明，gcc 和 MSVC 上的 abs() 实际上是编译器内在的.我只是使用 gcc -O3 优化重做了所有测试结果.

I added another implementation which is using the simple and intuitive abs() function. It turns out abs() on gcc and MSVC is actually a compiler intrinsic. I redid all the test results simply using gcc -O3 optimizations.

如您所见，Rotem 的 SIMD 实现和 abs() 实现是最快的，其次是两个 XOR 实现，最后是分支实现.

As you can see the Rotem's SIMD implementaiton and the abs() implementation are the fastest, followed by the two XOR implementations, and finally the branching implementations.

在两种 XOR 实现中，递减 for 循环的实现实际上稍微慢一些，因为它的循环包含 14 条指令，而递增循环仅包含 13 条指令.

Of the two XOR implementations the one that decrements the for loop is actually slightly slower as it's loop contains 14 instructions whereas the increment loop contains only 13.

Rotem 的 SIMD 实现和 abs() 实现实际上都依赖于 PABSD 指令，并且都有包含 7 条指令的循环.然而，速度上的细微差别(SIMD 稍快)来自这样一个事实，即优化的 SIMD 实现假定内存将始终包含 4 个整数的倍数(128 位)，而 abs() 实现需要额外的指令测试内存不包含4个整数的倍数的情况.

Rotem's SIMD implementation and the abs() implementation actually both rely on the PABSD instruction and both have loops containing 7 instructions. The slight difference in speed (SIMD being slightly faster) however comes from the fact that the optimized SIMD implementation assumes the memory will always contain multiples of 4 integers (128bits) whereas the abs() implementation requires extra instructions to test the cases where the memory does not contain multiples of 4 integers.

这里令人惊奇的是，通过简单地使用 abs()，我们可以通过调用 C 库函数的简单性实现与 SIMD 几乎完全相同的速度.不使用 -march=native 的 abs() 循环仅长 4 条指令，它使用 PSRAD 而不是使用 PABSD、PXOR 和 PSUBD 指令.


What's amazing here is that by simply using abs() we can achieve almost exactly the same speed as SIMD with the simplicity of calling a C library function. The abs() loop without using -march=native is only 4 instructions longer, which instead of utilizing PABSD, it uses PSRAD, PXOR, and PSUBD instructions.
事实证明，可移植(或非本地)abs() 程序集几乎与 XOR 实现完全相同.
It turns out the portable (or non-native) abs() assembly is nearly exactly that of the XOR implementation.
这是abs():
psrad   $31, %xmm0
pxor    %xmm0, %xmm1
psubd   %xmm0, %xmm1

这是异或:
psrad   $31, %xmm1
movdqa  %xmm1, %xmm2
pxor    %xmm1, %xmm0
pand    %xmm3, %xmm2
paddd   %xmm2, %xmm0

现在让我们将这些转换回 C 代码:
Now lets convert these back into C code:
这是abs():
int ntemp = array[i] >> 31;
array[i] ^= ntemp;
array[i] -= ntemp;

这是异或:
uint32_t ntemp = array[i] >> 31;
array[i] ^= ntemp;
array[i] += ntemp & 1;

不同之处在于我们在原始 XOR 实现中有一个额外的按位与运算.
The difference is in the fact that we have an extra bitwise AND operation in the original XOR implementation.
使用abs()！:D

                        这篇关于在C中有效地取整数向量的绝对值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在C中有效地取整数向量的绝对值 [英] Efficiently taking the absolute value of an integer vector in C

问题描述

推荐答案

测试结果

Test Results

结论

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在C中有效地取整数向量的绝对值 [英] Efficiently taking the absolute value of an integer vector in C

问题描述

推荐答案

测试结果

Test Results

结论

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭