GCC 的 sqrt() 编译后如何工作?使用哪种root方法?牛顿-拉夫森? [英] How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

查看:14
本文介绍了GCC 的 sqrt() 编译后如何工作?使用哪种root方法?牛顿-拉夫森?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只是对 GCC 上 math.h 的标准 sqrt() 感到好奇.我使用 Newton-Raphson 编写了我自己的 sqrt() 来做到这一点!

Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!

推荐答案

是的,我知道 fsqrt.但是 CPU 是怎么做的呢?我无法调试硬件

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware

现代 CPU 中的典型 div/sqrt 硬件使用 2 基数的幂一次计算多个结果位.例如http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf 提供了 Radix-16 div/sqrt ALU 设计的详细信息,并将其与 Penryn 中的设计进行了比较.(他们声称延迟更低,功耗更低.)我看了看图片;看起来一般的想法是做一些事情并通过乘法器和加法器迭代地反馈结果,基本上就像长除法一样.而且我认为类似于您在软件中进行一次位划分的方式.

Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.

英特尔 Broadwell 推出了 Radix-1024 div/sqrt 单元.关于 RWT 的讨论 询问 Penryn (Radix-16) 和布罗德韦尔.例如加宽 SIMD 向量除法器,因此 256 位除法比 128 位除法更慢,并增加基数.

Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.

也许还可以看到

  • The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
  • https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
  • https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations

但是无论硬件如何工作,IEEE 要求 sqrt(和 mul/div/add/sub)给出正确舍入的结果,即错误 <= 0.5 ulp,所以你不需要知道它是如何工作的,只需要知道性能.这些操作是特殊的,其他函数如logsin没有有这个要求,而真正的库实现通常没有那么准确.(还有 x87 fsin 对于接近 Pi/2 的输入来说绝对不是那么准确,因为范围减少中的灾难性抵消会导致潜在的巨大相对误差.)

But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)

请参阅 https://agner.org/optimize/ 了解 x86 指令表,包括吞吐量和延迟标量和 SIMD sqrtsd/sqrtss 及其更广泛的版本.我在 浮点除法与浮点乘法

See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication

对于非 x86 硬件 sqrt,您必须查看其他供应商发布的数据,或测试人员的结果.

For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.

与大多数指令不同,sqrt 性能通常取决于数据.(通常更高的有效位或更大的结果需要更长的时间).

Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).

这篇关于GCC 的 sqrt() 编译后如何工作?使用哪种root方法?牛顿-拉夫森?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆