如何在x64 CPU上快速计算sincos? [英] How to compute sincos fast on a x64 CPU?

查看:79
本文介绍了如何在x64 CPU上快速计算sincos?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个熟悉SSE/AVX指令系列的用户以及熟悉其性能分析的用户所面临的一个问题.我看到了许多不同的实现和方法,范围从 SSE2较旧到较新的.网络充斥着这样的链接.但是就我个人而言,我在sse组装分析方面没有丰富的经验.有些人指出了微指令,缓存,这需要一些低级的知识.因此,我要求您提供一些提示和您的个人经历.如果您有时间展开一些比较,那么您将了解什么是最快的"以及为什么选择了什么方法.实现可能不是那么精确,单个FP精度的10-16位就足够了.越多越好,但是不影响速度.

PS.为了避免元数据泛滥,我可以详细描述任务:

  • 给出标量参数x(以弧度为单位),该参数在xmm寄存器中传递(根据x64快速调用约定).
  • 编写具有签名 __ m128 sincos(float x)的函数;返回其sin(x)和cos(x)值的近似值.
  • 返回值应在一个xmm寄存器中,并以最快的方式进行计算,以满足10位精度的要求.
  • 参数可以是任何实数数字(但不能为 nan inf ,依此类推).如果方法需要实参规范化,则其性能实现(fmod())也是主题.但是问题不在于处理特殊的计划生育案件.

这可能是重复的,但是我在这里找不到类似的问题,因此,如果已经存在,请指出.

解决方案

我发现了 Julien的最新版本Pommier实现,由于Giovanni Garberoglio而被移植到zlib下的AVX/AVX2:

http://software-lisc.fbk.eu/avx_mathfun/

在i7 3770k的单核上,它的工作速度非常快,每秒可进行80-90M次迭代,每次迭代可提供8次正弦和8个cose.如果我每次迭代调用8 sinf()和8 cosf(),则与〜15Mhz相比(msvc2017 x64库中的函数,带有avx编译器设置)


UPD:还有一个很棒的 FastTrigo 代码示例,其中 FT :: sincos()函数比Julien Pommier的实现快20%.而他的 FT :: sincos()提供了准确的10位保证精度.

This is a question addresed to users, experienced in SSE/AVX instruction family, and those of them, who are familiar with its performance analysis. I saw a lot of different implementations and approaches, ranging from older for SSE2 to newer ones. Web is flooded with such a links. But personally i am not deeply experienced in sse assembly analyze. Some people are pointing out to the uops, caches, and that requires some low level knowledge. So i am asking for an hints and your personal experiences. If you have some time to roll out some comparison, on "What is fastest" and why, what approaches you looked at. Implementation maybe not so precise, 10-16 bits of single FP precision is good enough. More is better, but when it does not affect speed.

PS. To try to avoid meta flood, i could describe task precisely with details:

  • Given scalar argument x (in radians), that is passed in xmm register (according to x64 fastcall convention).
  • Write a function with signature __m128 sincos(float x); that returns its sin(x) and cos(x) values approximations.
  • Return value should be inside one xmm register and to be calculated in a fastest possible manner, to satisfy 10-bit precision requirement.
  • Argument could be any real number (but not nan, inf, so on). In case if argument normalisation is required by approach its performant implementation(fmod()) would be also the subject. But question is not about handling special FP cases.

This may be a duplicate, but i have failed to find similar question here, so please point me, if there is already one.

解决方案

I have discovered great modern revision of Julien Pommier implementations, ported for AVX/AVX2 under zlib, thanks to Giovanni Garberoglio:

http://software-lisc.fbk.eu/avx_mathfun/

It works really fast, 80-90M iterations per second on single core of i7 3770k, giving 8 sines and 8 coses per iteration. compared to ~15Mhz if i call 8 sinf() and 8 cosf() per iteration (functions from msvc2017 x64 library, with avx compiler settings)


UPD: Also there is an excellent FastTrigo code samples, where FT::sincos() function is 20% faster than Julien Pommier's implementation. And his FT::sincos() provides exactly 10 bit of guranteed accuracy.

这篇关于如何在x64 CPU上快速计算sincos?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆