编译后GCC的sqrt()如何工作?使用哪种根方法?牛顿-拉夫森? [英] How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?
问题描述
对来自GCC上math.h的标准sqrt()
的好奇心很好.我使用牛顿-拉夫森(Newton-Raphson)编写了自己的sqrt()
代码!
Just curiosity about the standard sqrt()
from math.h on GCC works. I coded my own sqrt()
using Newton-Raphson to do it!
推荐答案
是的,我知道fsqrt.但是,CPU是如何做到的呢?我无法调试硬件
yeah, I know fsqrt. But how the CPU does it? I can't debug hardware
现代CPU中典型的div/sqrt硬件使用2基的幂一次计算多个结果位.例如 http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf 提供了Radix-16 div/sqrt ALU设计的详细信息,并将其与Penryn中的设计进行了比较. (他们声称延迟时间更短,功耗更低.)我看了看照片.看起来一般的想法是做一些事情,并通过乘法器和加法器迭代地反馈结果,基本上就像长除法一样.而且我认为与您在软件中进行一次位分割的方式类似.
Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.
英特尔Broadwell推出了Radix-1024 div/sqrt单元. 有关RWT的讨论询问了Penryn(Radix-16)之间的变化)和Broadwell.例如扩宽SIMD向量分频器,使256位除速度比128位减慢,并且基数增加.
Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.
也许也看到
- 英特尔x86处理器的整数除法算法- Merom的Radix-2和Radix-4分频器被Penryn的Radix-16取代. (Core2 65nm和45nm)
- https://electronics.stackexchange.com /questions/280673/为什么硬件划分需要比乘法更长的时间
- https://scicomp .stackexchange.com/questions/187/为什么要进行除法运算,所以它比其他算术运算要复杂得多
- The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
- https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
- https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations
但是无论硬件如何工作, IEEE要求sqrt
(和mul/div/add/sub)给出正确舍入的结果,即错误< = 0.5 ulp,因此您无需了解其工作原理,只需了解性能.这些操作是特殊的,其他功能,例如log
和sin
却没有 的要求,并且实际的库实现通常不那么准确. (并且 x87
But however the hardware works, IEEE requires sqrt
(and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log
and sin
do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin
is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)
有关x86指令表(包括吞吐量和延迟)的信息,请参见 https://agner.org/optimize/.标量和SIMD sqrtsd
/sqrtss
及其更宽的版本.我在浮点除法与浮点乘法
See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd
/ sqrtss
and their wider versions. I collected up the results in Floating point division vs floating point multiplication
对于非x86硬件sqrt,您必须查看其他供应商发布的数据或经过测试的人的结果.
For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.
与大多数指令不同,sqrt
性能通常取决于数据. (通常,更高的有效位或更大的结果耗时更长.)
Unlike most instructions, sqrt
performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).
这篇关于编译后GCC的sqrt()如何工作?使用哪种根方法?牛顿-拉夫森?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!