Ivy Bridge 上 RDRAND 指令的延迟和吞吐量是多少? [英] What is the latency and throughput of the RDRAND instruction on Ivy Bridge?

查看:29
本文介绍了Ivy Bridge 上 RDRAND 指令的延迟和吞吐量是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法在 agner.org 上找到任何关于 agner.org 的信息href="http://en.wikipedia.org/wiki/RdRand" rel="noreferrer">RDRAND 指令.但是,该处理器存在,因此信息必须在那里.

I cannot find any info on agner.org on the latency or throughput of the RDRAND instruction. However, this processor exists, so the information must be out there.

实际上最新的优化手册提到了这个指令.它被记录为 <200 个周期,并且在 Ivy Bridge 上的总带宽至少为 500MB/s.但是,由于延迟和吞吐量是可变的,因此对这条指令进行一些更深入的统计会很好.

Actually the newest optimization manual mentions this instruction. It is documented as <200 cycles, and a total bandwidth of at least 500MB/s on Ivy Bridge. But some more in-depth statistics on this instruction would be great since the latency and throughput is variable.

推荐答案

我写了 librdrand.这是使用 RdRand 指令用随机数填充缓冲区的一组非常基本的例程.

I wrote librdrand. It's a very basic set of routines to use the RdRand instruction to fill buffers with random numbers.

我们在 IDF 上展示的性能数据来自我编写的测试软件,该软件在 Linux 中使用 pthread 生成了许多线程.每个线程使用 RdRand 使用随机数填充内存缓冲区.该程序测量平均速度并可以在改变线程数的同时进行迭代.

The performance data we showed at IDF is from test software I wrote that spawns a number of threads using pthreads in Linux. Each thread pulls fills a memory buffer with random numbers using RdRand. The program measures the average speed and can iterate while varying the number of threads.

由于从每个内核到共享 DRNG 和返回的往返通信延迟比在 DRNG 处生成随机数所需的时间长,因此随着您添加线程,平均性能明显增加,直到最大达到吞吐量.IVB 上 DRNG 的物理最大吞吐量为 800MBytes/s.具有 8 个线程的 4 核 IVB 管理大约 780Mbytes/s 的速度.线程和内核越少,实现的数量就越少.500MB/s 的数字有点保守,但当您试图做出诚实的性能声明时,您必须这样做.

Since there is a round trip communications latency from each core to the shared DRNG and back that is longer than the time needed to generate a random number at the DRNG, the average performance obviously increases as you add threads, up until the maximum throughput is reached. The physical maximum throughput of the DRNG on IVB is 800MBytes/s. A 4 core IVB with 8 threads manages something of the order of 780Mbytes/s. With fewer threads and cores, lower numbers are achieved. The 500MB/s number is somewhat conservative, but when you're trying to make honest performance claims, you have to be.

由于 DRNG 以固定频率 (800MHz) 运行,而内核频率可能会有所不同,因此每个 RdRand 的内核时钟周期数会有所不同,具体取决于内核频率和同时访问 DRNG 的其他内核数.IDF 演示中给出的曲线是预期结果的真实表现.总性能受核心时钟频率影响不大,但影响不大.线程数占主导地位.

Since the DRNG runs at a fixed frequency (800MHz) while the core frequencies may vary, the number of core clock cycles per RdRand varies, depending on the core frequency and the number of other cores simultaneously accessing the DRNG. The curves given in the IDF presentation are a realistic representation of what to expect. The total performance is affected a little by core clock frequency, but not much. The number of threads is what dominates.

在测量 RdRand 性能以实际使用"RdRand 结果时应该小心.如果你不这样做,I.E.你这样做了.. RdRand R6, RdRand R6,..., RdRand R6 重复了很多次,性能会被认为是人为的高.由于数据在被覆盖之前未被使用,因此 CPU 管道在发出下一条指令之前不会等待数据从 DRNG 返回.我们编写的测试将结果数据写入将位于片上缓存中的内存中,因此管道会停止等待数据.这也是使用 RdRand 的超线程比使用其他类型的代码更有效的原因.

One should be careful when measuring RdRand performance to actually 'use' the RdRand result. If you don't, I.E. you did this.. RdRand R6, RdRand R6,....., RdRand R6 repeated many times, the performance would read as being artificially high. Since the data isn't used before it is overwritten, the CPU pipeline doesn't wait for the data to come back from the DRNG before it issues the next instruction. The tests we wrote write the resulting data to memory that will be in on-chip cache so the pipeline stalls waiting for the data. That is also why hyperthreading is so much more effective with RdRand than with other sorts of code.

IDF 幻灯片中提供了特定平台、时钟速度、Linux 版本和 GCC 版本的详细信息.我不记得我头顶上的数字.有较慢的可用芯片和较快的可用芯片.我们给出的每条指令 <200 个周期的数量是基于对每条指令大约 150 个核心周期的测量得出的.

The details of the specific platform, clock speed, Linux version and GCC version were given in the IDF slides. I don't remember the numbers off the top of my head. There are chips available that are slower and chips available that are faster. The number we gave for <200 cycles per instruction is based on measurements of about 150 core cycles per instruction.

这些芯片现已上市,因此任何精通 rdtsc 使用的人都可以进行同样的测试.

The chips are available now, so anyone well versed in the use of rdtsc can do the same sort of test.

这篇关于Ivy Bridge 上 RDRAND 指令的延迟和吞吐量是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆