编译后GCC的sqrt()如何工作?使用哪种根方法?牛顿-拉夫森? [英] How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

查看:126
本文介绍了编译后GCC的sqrt()如何工作?使用哪种根方法?牛顿-拉夫森?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对来自GCC上math.h的标准sqrt()的好奇心很好.我使用牛顿-拉夫森(Newton-Raphson)编写了自己的sqrt()代码!

Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!

推荐答案

是的,我知道fsqrt.但是,CPU是如何做到的呢?我无法调试硬件

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware

现代CPU中典型的div/sqrt硬件使用2基的幂一次计算多个结果位.例如 http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf 提供了Radix-16 div/sqrt ALU设计的详细信息,并将其与Penryn中的设计进行了比较. (他们声称延迟时间更短,功耗更低.)我看了看照片.看起来一般的想法是做一些事情,并通过乘法器和加法器迭代地反馈结果,基本上就像长除法一样.而且我认为与您在软件中进行一次位分割的方式类似.

Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.

英特尔Broadwell推出了Radix-1024 div/sqrt单元. 有关RWT的讨论询问了Penryn(Radix-16)之间的变化)和Broadwell.例如扩宽SIMD向量分频器,使256位除速度比128位减慢,并且基数增加.

Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.

也许也看到

  • The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
  • https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
  • https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations

但是无论硬件如何工作, IEEE要求sqrt(和mul/div/add/sub)给出正确舍入的结果,即错误< = 0.5 ulp,因此您无需了解其工作原理,只需了解性能.这些操作是特殊的,其他功能,例如logsin却没有 的要求,并且实际的库实现通常不那么准确. (并且 x87 对于Pi/2附近的输入绝对不是那么精确,因为在范围缩小中发生灾难性的取消会导致潜在的巨大相对误差.

But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)

有关x86指令表(包括吞吐量和延迟)的信息,请参见 https://agner.org/optimize/.标量和SIMD sqrtsd/sqrtss及其更宽的版本.我在浮点除法与浮点乘法

See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication

对于非x86硬件sqrt,您必须查看其他供应商发布的数据或经过测试的人的结果.

For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.

与大多数指令不同,sqrt性能通常取决于数据. (通常,更高的有效位或更大的结果耗时更长.)

Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).

这篇关于编译后GCC的sqrt()如何工作?使用哪种根方法?牛顿-拉夫森?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆