32 位和 64 位进程之间的 memcpy 性能差异 [英] memcpy performance differences between 32 and 64 bit processes

查看:28
本文介绍了32 位和 64 位进程之间的 memcpy 性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有 XP64 的 Core2 机器(戴尔 T5400).

We have Core2 machines (Dell T5400) with XP64.

我们观察到在运行 32 位进程时,memcpy 的性能大约是1.2GByte/s;但是在 64 位进程中使用 memcpy达到约 2.2GByte/s(或 2.4GByte/s与英特尔编译器 CRT 的 memcpy).虽然最初的反应可能只是解释这个由于可用的寄存器更广泛在 64 位代码中,我们观察到我们自己的类似 memcpySSE 汇编代码(应该使用 128 位无论 32/64-bitness of该过程)显示出类似的上限它实现的复制带宽.

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.

我的问题是,这实际上有什么区别由于 ?32位进程一定要跳过吗一些额外的 WOW64 箍来获得 RAM 吗?是不是东西与 TLB 或预取器或...有关吗?

My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?

感谢您的任何见解.

也在 英特尔论坛上提出.

推荐答案

当然,您确实需要查看在 memcpy 的最内层循环中正在执行的实际机器指令,通过使用调试器.其他的都只是猜测.

Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.

我的问题是它可能与 32 位和 64 位本身没有任何关系;我的猜测是更快的库例程是使用 SSE 非临时存储编写的.

My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.

如果内循环包含传统加载存储指令的任何变体,然后必须将目标内存读入机器的缓存,修改并写回.由于该读取是完全不必要的——正在读取的位会立即被覆盖——您可以通过使用绕过缓存的非临时"写入指令来节省一半的内存带宽.那样的话,目标内存只是写入到内存的单向旅行而不是往返.

If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.

我不知道英特尔编译器的 CRT 库,所以这只是一个猜测.没有特别的理由为什么 32 位 libCRT 不能做同样的事情,但是你引用的加速是在我期望的范围内,只需将 movdqa 指令转换为 movnt...

I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...

由于 memcpy 不进行任何计算,因此它始终受您读取和写入内存的速度的限制.

Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.

这篇关于32 位和 64 位进程之间的 memcpy 性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆