在amd64上使用SIMD,什么时候使用更多指令比从内存加载更好? [英] Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

查看:115
本文介绍了在amd64上使用SIMD,什么时候使用更多指令比从内存加载更好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些对性能敏感的代码.使用SSEn和AVX的SIMD实现使用大约30条指令,而使用4096字节查找表的版本使用大约8条指令.在微基准测试中,查找表的速度提高了40%.如果我使用微基准测试,尝试使缓存中的100次迭代无效,则它们的外观大致相同.在我的真实程序中,出现似乎是非加载版本更快,但是要获得可证明的良好测量结果确实很困难,而且我的测量方法是双向的.

I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements go both ways.

我只是想知道是否有一些好的方法来考虑哪种方法更适合使用,或者是用于此类决策的标准基准测试技术.

I'm just wondering if there are some good ways to think about which one would be better to use, or standard benchmarking techniques for this type of decision.

推荐答案

查找表在实际代码中很少获得的性能优势,尤其是当它们最大为4k字节时.现代处理器可以如此快速地执行计算,因此,几乎总是可以根据需要进行计算,而不是尝试将其缓存在查找表中,这样总是更快.唯一的例外是计算量过高.当您在谈论30条指令与8条指令之间的差异时,显然不是这种情况.

Look-up tables are rarely a performance win in real-world code, especially when they're as large as 4k bytes. Modern processors can execute computations so quickly that it is almost always faster to just do the computations as needed, rather than trying to cache them in a look-up table. The only exception to this is when the computations are prohibitively expensive. That's clearly not the case here, when you're talking about a difference of 30 vs. 8 instructions.

您的微基准测试表明基于LUT的方法更快的原因是因为整个LUT都已加载到缓存中并且从未被驱逐过.这使得它的使用有效地免费,这样您就可以在执行8条和30条指令之间进行比较.好吧,您可以猜测哪一个会更快. :-)实际上,您确实猜到了这一点,并通过显式缓存无效对其进行了证明.

The reason your micro-benchmark is suggesting that the LUT-based approach is faster is because the entire LUT is getting loaded into cache and never evicted. This makes its usage effectively free, such that you are comparing between executing 8 and 30 instructions. Well, you can guess which one will be faster. :-) In fact, you did guess this, and proved it with explicit cache invalidation.

在现实世界中的代码中,除非您处理的是一个非常短而紧密的循环,否则LUT不可避免地会从缓存中逐出(特别是如果它的大小如此之大,或者如果您执行大量的代码,则尤其如此)在两次调用之间进行优化)之间,您将付出重新加载代码的代价.您似乎没有足够的操作需要同时执行,因此可以通过推测性负载减轻这种损失.

In real-world code, unless you're dealing with a very short, tight loop, the LUT will inevitably be evicted from the cache (especially if it's as large as this one is, or if you execute a lot of code in between calls to the code being optimized), and you'll pay the penalty of re-loading it. You don't appear to have enough operations that need to be performed concurrently such that this penalty can be mitigated with speculative loads.

(大型)LUT的另一个隐藏成本是它们冒着从缓存中逐出 code 的风险,因为大多数现代处理器具有统一的数据和指令缓存.因此,即使基于LUT的实现稍微快一点,它也冒着很大的风险减慢其他所有的速度.微型基准测试不会显示此内容. (但是实际上要对您的真实代码进行基准测试,因此,在可行的情况下,这始终是一件好事.否则,请继续阅读.)

The other hidden cost of (large) LUTs is that they risk evicting code from the cache, since most modern processors have unified data and instruction caches. Thus, even if the LUT-based implementation is slightly faster, it runs a very strong risk of slowing everything else down. A microbenchmark won't show this. (But actually benchmarking your real code will, so that's always a good thing to do when feasible. If not, read on.)

我的经验法则是,如果基于LUT的方法不能在现实基准测试中胜过其他方法,那么我不会使用它.听起来像是这种情况.如果基准测试结果太接近调用了,那就没关系了,所以选择不要使代码膨胀4k的实现.

My rule of thumb is, if the LUT-based approach is not a clear performance win over the other approach in real-world benchmarks, I don't use it. It sounds like that is the case here. If the benchmark results are too close to call, it doesn't matter, so pick the implementation that doesn't bloat your code by 4k.

这篇关于在amd64上使用SIMD,什么时候使用更多指令比从内存加载更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆