什么是对的Ivy Bridge的RDRAND指令的延迟和吞吐量? [英] What is the latency and throughput of the RDRAND instruction on Ivy Bridge?

查看:528
本文介绍了什么是对的Ivy Bridge的RDRAND指令的延迟和吞吐量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到在延迟或吞吐量的 agner.org 任何信息://en.wikipedia.org/wiki/RdRand> RDRAND 指令。但是,这种处理器的存在,所以信息一定是在那里。

I cannot find any info on agner.org on the latency or throughput of the RDRAND instruction. However, this processor exists, so the information must be out there.

编辑:其实最新的优化手册提到了这一指令。据记载为< 200次循环,并且至少500MB /秒的上常春藤桥的总带宽。但一些更深入的关于该指令的统计数据将是巨大的,因为延迟和吞吐量是可变的。

Actually the newest optimization manual mentions this instruction. It is documented as <200 cycles, and a total bandwidth of at least 500MB/s on Ivy Bridge. But some more in-depth statistics on this instruction would be great since the latency and throughput is variable.

推荐答案

我写librdrand。这是一个非常基本的组例程的使用RdRand指令,以填补缓冲器与随机数。

I wrote librdrand. It's a very basic set of routines to use the RdRand instruction to fill buffers with random numbers.

我们发现在IDF性能数据是从测试软件,我写了一个生成数字的pthreads使用Linux中的线程。每个线程拉填充用RdRand随机数的内存缓冲区。程序测量平均速度和同时改变线程的数目可以迭代

The performance data we showed at IDF is from test software I wrote that spawns a number of threads using pthreads in Linux. Each thread pulls fills a memory buffer with random numbers using RdRand. The program measures the average speed and can iterate while varying the number of threads.

由于存在来自每个核心到共享DRNG和背面比产生在DRNG一个随机数所需要的时间较长的往返通信的延迟,平均性能明显增加在添加线程,直到最大可以通过为止。对IVB的DRNG的物理最大吞吐量为800MBytes /秒。 A 4 IVB核心与8个线程管理的780Mbytes / s量级的东西。用更少的线程和内核,较低的数字得以实现。在500MB / s的数字有点保守,但是当你试图让诚实的性能要求,你必须是。

Since there is a round trip communications latency from each core to the shared DRNG and back that is longer than the time needed to generate a random number at the DRNG, the average performance obviously increases as you add threads, up until the maximum throughput is reached. The physical maximum throughput of the DRNG on IVB is 800MBytes/s. A 4 core IVB with 8 threads manages something of the order of 780Mbytes/s. With fewer threads and cores, lower numbers are achieved. The 500MB/s number is somewhat conservative, but when you're trying to make honest performance claims, you have to be.

由于DRNG运行在一个固定的频率(800MHz的),而核心的频率可能会发生变化,每RdRand核心时钟周期的数目而变化,这取决于核心频率和其它芯的同时访问DRNG数。在IDF presentation给出的曲线是什么样的期待现实重新presentation。总的性能受内核时钟频率一点,但不多。线程的数目是占主导地位。

Since the DRNG runs at a fixed frequency (800MHz) while the core frequencies may vary, the number of core clock cycles per RdRand varies, depending on the core frequency and the number of other cores simultaneously accessing the DRNG. The curves given in the IDF presentation are a realistic representation of what to expect. The total performance is affected a little by core clock frequency, but not much. The number of threads is what dominates.

测量RdRand性能实际使用的RdRand结果时,一应慎重。如果不这样做的,即你这样做.. RdRand R6,RdRand R6,.....,RdRand R6反复多次,性能会读作为是虚高。既然是覆盖之前不使用数据时,CPU管道不等待数据来自于DRNG回它发出下一个指令之前。我们写测试写得出的数据到内存中,将在片上高速缓存使流水线停顿等待数据。这也是为什么超线程这么多有效的与RdRand比其他种类的code的。

One should be careful when measuring RdRand performance to actually 'use' the RdRand result. If you don't, I.E. you did this.. RdRand R6, RdRand R6,....., RdRand R6 repeated many times, the performance would read as being artificially high. Since the data isn't used before it is overwritten, the CPU pipeline doesn't wait for the data to come back from the DRNG before it issues the next instruction. The tests we wrote write the resulting data to memory that will be in on-chip cache so the pipeline stalls waiting for the data. That is also why hyperthreading is so much more effective with RdRand than with other sorts of code.

特定于平台的细节,时钟速度的Linux版本和GCC版本在IDF幻灯片给予。我不记得的数字从我的头顶。有迹象表明,较慢的速度更快,提供的芯片和芯片提供。我们给了&LT数量;每指令200次是根据每个指令大约150内核周期测量

The details of the specific platform, clock speed, Linux version and GCC version were given in the IDF slides. I don't remember the numbers off the top of my head. There are chips available that are slower and chips available that are faster. The number we gave for <200 cycles per instruction is based on measurements of about 150 core cycles per instruction.

该芯片现已上市,所以任何人在使用RDTSC可以做同样的测试样样精通。

The chips are available now, so anyone well versed in the use of rdtsc can do the same sort of test.

这篇关于什么是对的Ivy Bridge的RDRAND指令的延迟和吞吐量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆