使用SIMD指令时,为什么此简单的C ++ SIMD基准测试运行速度较慢? [英] Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

查看:439
本文介绍了使用SIMD指令时,为什么此简单的C ++ SIMD基准测试运行速度较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑编写SIMD向量数学库,因此,作为一个快速的基准测试,我编写了一个程序,该程序进行1亿个(按4个浮点数)矢量逐元素乘法,并将它们相加在一起.对于我的经典非SIMD变体,我只是制作了一个具有4个浮点数的结构,并编写了自己的乘法函数"multiplyTwo",该函数将两个这样的结构元素明智地相乘,然后返回另一个结构.对于我的SIMD版本,我使用了"immintrin.h"以及__m128,_mm_set_ps和_mm_mul_ps.我正在i7-8565U处理器(威士忌湖)上运行,并使用以下命令进行编译:g++ main.cpp -mavx -o test.exe以在GCC中启用AVX扩展指令.

I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running on an i7-8565U processor (whiskey lake) and compiling with: g++ main.cpp -mavx -o test.exe to enable the AVX extension instructions in GCC.

奇怪的是,SIMD版本大约需要1.4秒,而非SIMD版本只需要1秒.我觉得好像做错了什么,因为我认为SIMD版本的运行速度应该快4倍.感谢您的帮助,代码如下.我已经在注释中放置了非SIMD代码,当前形式的代码是SIMD版本.

The weird thing is that the SIMD version takes about 1.4 seconds, and the non-SIMD version takes only 1 second. I feel as though I'm doing something wrong, as I thought the SIMD version should run 4 times faster. Any help is appreciated, the code is below. I've placed the Non-SIMD code in comments, the code in it's current form is the SIMD version.

#include "immintrin.h" // for AVX 
#include <iostream>

struct NonSIMDVec {
    float x, y, z, w;
};

NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b);

int main() {
    union { __m128 result; float res[4]; };
    // union { NonSIMDVec result; float res[4]; };

    float total = 0; 
    for(unsigned i = 0; i < 100000000; ++i) {
        __m128 a4 = _mm_set_ps(0.0000002f, 1.23f, 2.0f, (float)i);
        __m128 b4 = _mm_set_ps((float)i, 1.3f, 2.0f, 0.000001f);
        // NonSIMDVec a4 = {0.0000002f, 1.23f, 2.0f, (float)i}; 
        // NonSIMDVec b4 = {(float)i, 1.3f, 2.0f, 0.000001f};

        result = _mm_mul_ps(a4, b4); 
        // result = multiplyTwo(a4, b4);

        total += res[0];
        total += res[1];
        total += res[2];
        total += res[3];
    }

    std::cout << total << '\n';
}

NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b)
{ return {a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w}; }

推荐答案

禁用优化功能(gcc默认值为-O0),内部函数通常很糟糕. 反优化-O0用于内在函数的code-gen 通常会遭受很多损失(甚至比标量更严重),并且某些类似于函数的内在函数会带来额外的存储/重载开销.加上-O0的额外存储转发延迟往往会造成更大的损失,因为使用1个矢量而不是4个标量执行操作时,ILP会更少.

With optimization disabled (the gcc default is -O0), intrinsics are often terrible. Anti-optimized -O0 code-gen for intrinsics usually hurts a lot (even more than for scalar), and some of the function-like intrinsics introduce extra store/reload overhead. Plus the extra store-forwarding latency of -O0 tends to hurt more because there's less ILP when you do things with 1 vector instead of 4 scalars.

使用gcc -march=native -O3

Use gcc -march=native -O3

但是,即使启用了优化,通过在循环中对每个向量进行水平相加,仍然会编写代码破坏SIMD的性能.请参见如何使用C中的SSE内在函数计算向量点积,以了解如何执行此操作: _mm_add_ps累积__m128 total向量,并且仅在循环外将其水平求和.

But even with optimization enabled, your code is still written to destroy the performance of SIMD by doing a horizontal add of each vector inside the loop. See How to Calculate Vector Dot Product Using SSE Intrinsic Functions in C for how to not do that: use _mm_add_ps to accumulate a __m128 total vector, and only horizontal sum it outside the loop.

通过在循环内执行标量total +=,使循环在FP添加延迟上成为瓶颈.该循环承载的依赖关系链意味着,在以addss延迟为4个周期的Skylake派生的微体系结构上,循环的运行速度不能超过每4个周期1个float. ( https://agner.org/optimize/)

You bottleneck your loop on FP-add latency by doing scalar total += inside the loop. That loop-carried dependency chain means your loop can't run any faster than 1 float per 4 cycles on your Skylake-derived microarchitecture where addss latency is 4 cycles. (https://agner.org/optimize/)

__m128 total更好,使用4或8个向量隐藏FP添加延迟,因此您的SIMD循环可能成为mul/add(或FMA)吞吐量而不是延迟的瓶颈.

Even better than __m128 total, use 4 or 8 vectors to hide FP add latency, so your SIMD loop can bottleneck on mul/add (or FMA) throughput instead of latency.

修复该问题后,正如@harold指出的那样,您在循环内使用_mm_set_ps的方式将导致编译器生成非常糟糕的asm.当操作数不是常量,或者至少是循环不变的时,在循环内部不是一个好的选择.

Once you fix that, then as @harold points out the way you're using _mm_set_ps inside the loop will result in pretty bad asm from the compiler. It's not a good choice inside a loop when the operands aren't constants, or at least loop-invariant.

您的示例显然是人为的;通常,您将从内存中加载SIMD向量.但是,如果确实需要更新__m128向量中的循环计数器,则可以使用tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0)).或者通过添加1.0、2.0、3.0和4.0来展开,这样循环传递的依赖关系在一个元素中仅为+ = 4.0.

Your example here is clearly artificial; normally you'd be loading SIMD vectors from memory. But if you did need to update a loop counter in a __m128 vector, you might use tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0)). Or unroll with adding 1.0, 2.0, 3.0, and 4.0 so the loop-carried dependency is only the += 4.0 in the one element.

x + 0.0甚至对于FP也是标识操作(可能带符号的零除外),因此您可以对其他元素执行此操作而无需更改它们.

x + 0.0 is the identity operation even for FP (except maybe with signed zero) so you can do it to the other elements without changing them.

或者对于向量的下限元素,可以使用_mm_add_ss(标量)仅对其进行修改.

Or for the low element of a vector, you can use _mm_add_ss (scalar) to only modify it.

这篇关于使用SIMD指令时,为什么此简单的C ++ SIMD基准测试运行速度较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆