哪里的SSE指令跑赢正常指令 [英] Where does the SSE instructions outperform normal instructions

查看:157
本文介绍了哪里的SSE指令跑赢正常指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在什么地方X86-64的SSE指令(矢量指令)跑赢了正常的指令。因为我所看到的是,所需要的执行SSE指令频繁加载和存储被作废我们有什么收益,由于矢量计算。所以,可能有人给我一个例子SSE code其中它执行比正常的code更好。

Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code.

它也许是因为我通过单独的每个参数,像这样...

Its maybe because I am passing each parameter separately, like this...

__m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]);
__m128i b = _mm_set_epi32(pb[0], pb[1], pb[2], pb[3]);
__m128i res = _mm_add_epi32(a, b);

for( i = 0; i < 4; i++ )
 po[i] = res.m128i_i32[i];

是不是有什么办法,我可以一口气通过所有4个整数,我的意思是通过整个128字节每年一气呵成?并指派 res.m128i_i32 PO 一气呵成?

Isn't there a way I can pass all the 4 integers at one go, I mean pass the whole 128 bytes of pa at one go? And assign res.m128i_i32 to po at one go?

推荐答案

意见汇总成一个答案:

您已经基本陷入映入​​大多数新手同样的陷阱。基本上有在你的榜样两个问题:

You have basically fallen into the same trap that catches most first-timers. Basically there are two problems in your example:


  1. 您在不当使用 _mm_set_epi32()

  2. 您有一个非常低的计算/负载店比。 (在您的示例1〜3)

_mm_set_epi32()是一个非常昂贵的内在。虽然它的方便使用,它不会编译为一个指令。一些编译器(如VS2010)可以使用时可能会产生很差的表演code _mm_set_epi32()


_mm_set_epi32() is a very expensive intrinsic. Although it's convenient to use, it doesn't compile to a single instruction. Some compilers (such as VS2010) can generate very poor performing code when using _mm_set_epi32().

相反,由于要装载的内存连续的块,你应该使用 _mm_load_si128()。这要求该指针被对准至16个字节。如果你不能保证这种对齐,可以使用 _mm_loadu_si128() - 但有性能损失。理想情况下,你应该正确对齐数据,以便无需诉诸使用 _mm_loadu_si128()

Instead, since you are loading contiguous blocks of memory, you should use _mm_load_si128(). That requires that the pointer is aligned to 16 bytes. If you can't guarantee this alignment, you can use _mm_loadu_si128() - but with a performance penalty. Ideally, you should properly align your data so that don't need to resort to using _mm_loadu_si128().

在与上证所真正有效的,你也想最大化你的计算/负载店比。我拍一个目标是3 - 每个内存访问4算术指令。这是一个相当高的比率。通常情况下,你必须重构code或重新设计的算法予以增加。在数据传递相结合是一种常见的做法。

The be truly efficient with SSE, you'll also want to maximize your computation/load-store ratio. A target that I shoot for is 3 - 4 arithmetic instructions per memory-access. This is a fairly high ratio. Typically you have to refactor the code or redesign the algorithm to increase it. Combining passes over the data is a common approach.

循环展开往往是必要的,当你有长依存关系链大循环机构最大限度地提高性能。

Loop unrolling is often necessary to maximize performance when you have large loop bodies with long dependency chains.

这成功地使用SSE,所以问题的一些例子来实现加速。

Some examples of SO questions that successfully use SSE to achieve speedup.

  • C code loop performance (non-vectorized)
  • C code loop performance [continued] (vectorized)
  • how to achieve 4 flops per cycle (contrived example for achieving peak processor performance)

这篇关于哪里的SSE指令跑赢正常指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆