快速整除测试(由2,3,4,5,..,16)? [英] Fast divisibility tests (by 2,3,4,5,.., 16)?

查看:112
本文介绍了快速整除测试(由2,3,4,5,..,16)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是最快的可分性测试?比方说,给定一个小端架构和32位有符号整数:如何计算速度非常快,一些是整除2,3,4,5,......多达16

警告:给定的code是唯一的例子。每一道线条都是独立的!只是明显的解决方案使用模运算是许多处理器,不具备DIV硬件(如多的臂)缓慢。一些编译器也无法做出这样的优化(比如,如果除数为函数的参数或依赖的东西)。

  Divisible_by_1 = DO();
Divisible_by_2 = IF((数字功放&;!1))做();
Divisible_by_3 =?
Divisible_by_4 =?
Divisible_by_5 =?
Divisible_by_6 =?
Divisible_by_7 =?
Divisible_by_8 =?
Divisible_by_9 =?
Divisible_by_10 =?
Divisible_by_11 =?
Divisible_by_12 =?
Divisible_by_13 =?
Divisible_by_14 =?
Divisible_by_15 =?
Divisible_by_16 = IF(数字功放&;!0x0000000F)做();

和特殊情况:

  Divisible_by_2k = IF(号及(TK-1))做(); // TK = 2 ** K =(2 * 2 * 2 * ...)k次


解决方案

这不是一个坏主意找出替代除法指令(包括基于x86 / x64的模),因为他们是很慢的。慢(甚至更慢)比大多数人意识到。这些提示%N,其中n是一个变量是给愚蠢的建议,因为这将不可避免地导致使用除法指令。另一方面%C(其中,c是常数)将允许编译器,以确定在其剧目可用的最佳算法。有时,这将是除法指令,但很多时候它不会。

本文档的TorbjörnGranlund公司表明,对于32位无符号mults所需的时钟周期的比率div的是4:26的K10上的SandyBridge(6.5倍)和3:45(15X)。 64位比例分别为4:92(23X)和5:77(14.4倍)。

的L列表示延迟。 T表示的列吞吐量。这具有与处理器的处理在之间平行的多个指令的能力做。的sandybridge可以发出一个32位乘法每隔一个周期或一个64位的每一个周期。为K10相应吞吐量是相反的。对于师K10需要完成整个序列,这可能又开始之前。我怀疑这是SandyBridge的相同。

使用K10,例如这意味着乘法的32位分割(45)相同数量(45)所需的循环过程中可以发出和它们的下一个到最后一个和最后一个将完成分立后一个和两个时钟周期完成。可在45乘法进行工作了很多。

这也是值得注意的的div已成为发展的K8,K9到K10效率较低:39〜45和32位和64位71到77个时钟周期。

在gmplib.org并在的 Granlund公司的页面 =htt​​p://www.nada.kth.se /〜特格/> 在斯德哥尔摩皇家理工学院含有更多的好东西,其中一些已被纳入gcc编译器。

What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?

WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).

Divisible_by_1 = do();
Divisible_by_2 = if (!(number & 1)) do();
Divisible_by_3 = ?
Divisible_by_4 = ?
Divisible_by_5 = ?
Divisible_by_6 = ?
Divisible_by_7 = ?
Divisible_by_8 = ?
Divisible_by_9 = ?
Divisible_by_10 = ?
Divisible_by_11 = ?
Divisible_by_12 = ?
Divisible_by_13 = ?
Divisible_by_14 = ?
Divisible_by_15 = ?
Divisible_by_16 = if(!number & 0x0000000F) do();

and special cases:

Divisible_by_2k = if(number & (tk-1)) do();  //tk=2**k=(2*2*2*...) k times

解决方案

It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.

In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).

The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.

Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.

It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.

Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.

这篇关于快速整除测试(由2,3,4,5,..,16)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆