为什么我的 SSE 代码比本地 C++ 代码慢? [英] Why my SSE code is slower than native C++ code?

查看:43
本文介绍了为什么我的 SSE 代码比本地 C++ 代码慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我是 SSE 的新手.我决定加速我的代码,但它似乎比我的本机代码运行得更慢.

First of all, I am new to SSE. I decided to accelerate my code, but it seems, that it works slower, then my native code.

这是一个计算平方和的例子.在我的 Intel i7-6700HQ 上,本机代码需要 0.43 秒,SSE 需要 0.52.那么,瓶颈在哪里?

This is an example, that calculates the sum of squares. On my Intel i7-6700HQ, it takes 0.43s for native code and 0.52 for SSE. So, where is a bottleneck?

inline float squared_sum(const float x, const float y)
{
    return x * x + y * y;
}

#define USE_SIMD

void calculations()
{
    high_resolution_clock::time_point t1, t2;

    int result_v = 0;

    t1 = high_resolution_clock::now();

    alignas(16) float data_x[4];
    alignas(16) float data_y[4];
    alignas(16) float result[4];
    __m128 v_x, v_y, v_res;
    for (int y = 0; y < 5120; y++)
    {
        data_y[0] = y;
        data_y[1] = y + 1;
        data_y[2] = y + 2;
        data_y[3] = y + 3;
        for (int x = 0; x < 5120; x++)
        {
            data_x[0] = x;
            data_x[1] = x + 1;
            data_x[2] = x + 2;
            data_x[3] = x + 3;
#ifdef USE_SIMD
            v_x = _mm_load_ps(data_x);
            v_y = _mm_load_ps(data_y);
            v_x = _mm_mul_ps(v_x, v_x);
            v_y = _mm_mul_ps(v_y, v_y);
            v_res = _mm_add_ps(v_x, v_y);
            _mm_store_ps(result, v_res);
#else
            result[0] = squared_sum(data_x[0], data_y[0]);
            result[1] = squared_sum(data_x[1], data_y[1]);
            result[2] = squared_sum(data_x[2], data_y[2]);
            result[3] = squared_sum(data_x[3], data_y[3]);
#endif

            result_v += (int)(result[0] + result[1] + result[2] + result[3]);
        }
    }

    t2 = high_resolution_clock::now();
    duration<double> time_span1 = duration_cast<duration<double>>(t2 - t1);
    std::cout << "Exec time:\t" << time_span1.count() << " s\n";
}

更新:根据评论修正代码.

我使用的是 Visual Studio 2017.为 x64 编译.

UPDATE: fixed code according to comments.

I am using Visual Studio 2017. Compiled for x64.

  • 优化:最大优化(偏好速度)(/O2)
  • 内联函数扩展:任何合适的 (/Ob2);
  • 偏爱大小或速度:偏爱快速代码 (/Ot)
  • 省略帧指针:是 (/Oy)

编译器生成已经优化的代码,所以现在很难进一步加速它.为了进一步加速代码,您可以做的一件事是并行化.

Compilers generate already optimized code, so nowadays it is hard to accelerate it even more. The one thing you can do, to accelerate code more, is parallelization.

感谢您的回答.它们基本相同,所以我接受 Søren V. Poulsen 的回答,因为它是第一个.

Thanks for the answers. They mainly the same, so I accept Søren V. Poulsen answer because it was the first.

推荐答案

现代编译是令人难以置信的机器,如果可能(并使用正确的编译标志)已经使用 SIMD 指令.

Modern compiles are incredible machines and will already use SIMD instructions if possible (and with the correct compilation flags).

确定编译器正在做什么的一个一般策略是查看代码的反汇编.如果您不想在自己的机器上执行此操作,您可以使用 Godbolt 之类的在线服务:https://gcc.godbolt.org/z/T6GooQ.

One general strategy to determine what the compiler is doing is looking at the disassembly of your code. If you don't want to do it on your own machine you can use an online service like Godbolt: https://gcc.godbolt.org/z/T6GooQ.

一个技巧是避免使用 atomic 来存储中间结果,就像您在这里所做的那样.原子值用于确保线程之间的同步,相对而言,这可能会带来非常高的计算成本.

One tip is to avoid atomic for storing intermediate results like you are doing here. Atomic values are used to ensure synchronization between threads, and this may come at a very high computational cost, relatively speaking.

这篇关于为什么我的 SSE 代码比本地 C++ 代码慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆