为什么复杂的memcpy / memset的卓越? [英] Why are complicated memcpy/memset superior?

查看:255
本文介绍了为什么复杂的memcpy / memset的卓越?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在调试时,我经常走进手写组件执行的memcpy和memset。这些都是使用流式指令通常实现(如果可用),循环展开,调整优化,等等......我最近也遇到了这个的'错误'由于glibc的的memcpy的优化。

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc.

现在的问题是:为什么不能在硬件厂商(英特尔,AMD)优化的具体情况

The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of

rep stos

rep movs

要予以承认,并做的最快的填充和复制上尽可能的自身的架构?

to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

推荐答案

成本。

在C库优化的memcpy 的成本是相当小的,也许几个星期的开发时间在这里和那里。你必须做出一个新的版本,每隔数年左右的时候处理器的功能改变足以保证重写。例如,GNU的的glibc 和苹果的<​​code> libSystem中都有一个的memcpy 这对SSE3指令集进行了优化。

The cost of optimizing memcpy in your C library is fairly minimal, maybe a few weeks of developer time here and there. You'll have to make a new version every several years or so when processor features change enough to warrant a rewrite. For example, GNU's glibc and Apple's libSystem both have a memcpy which is specifically optimized for SSE3.

在硬件优化的成本要高得多。它不仅是在开发成本方面更贵(设计CPU是远远超过编写用户空间的组装code更困难),但它会增加处理器的晶体管数量。可能有一些负面影响:

The cost of optimizing in hardware is much higher. Not only is it more expensive in terms of developer costs (designing a CPU is vastly more difficult than writing user-space assembly code), but it would increase the transistor count of the processor. That could have a number of negative effects:

  • 在功耗增加
  • 在增加单位成本
  • 延迟增加了某些CPU子系统
  • 下最大时钟速度

在理论上,它可能对性能和单位成本的整体的负面影响。

In theory, it could have an overall negative impact on both performance and unit cost.

美心:不要做它在硬件如果软件解决方案是不够好

Maxim: Don't do it in hardware if the software solution is good enough.

注意:你列举的错误是不是真的的glibc 错误WRT在C规范。它更复杂。基本上,glibc的人说,的memcpy 的行为与发布的标准,和其他一些人抱怨说的memcpy 应该化名为 memmove与

Note: The bug you've cited is not really a bug in glibc w.r.t. the C specification. It's more complicated. Basically, the glibc folks say that memcpy behaves exactly as advertised in the standard, and some other folks are complaining that memcpy should be aliased to memmove.

时间一个故事:这让我想起了投诉,当他跑了他的比赛在603处理器,而不是601(这是从上世纪90年代)在Mac游戏开发商有我。 601有未对齐的加载和存储以最小的性能损失的硬件支持。 603只是产生了一个异常;通过卸载到内核I想象加载/存储单元可以作出简单得多,可能使该处理器更快和更便宜的方法。 Mac OS的微内核通过执行所需要的加载/存储操作和返回控制到过程处理的异常。

Time for a story: It reminds me of a complaint that a Mac game developer had when he ran his game on a 603 processor instead of a 601 (this is from the 1990s). The 601 had hardware support for unaligned loads and stores with minimal performance penalty. The 603 simply generated an exception; by offloading to the kernel I imagine the load/store unit could be made much simpler, possibly making the processor faster and cheaper in the process. The Mac OS nanokernel handled the exception by performing the required load/store operation and returning control to the process.

不过这个开发商有一个自定义的blitting函数写像素到做不对齐的加载和存储的画面。游戏性能是罚款601,但可恶的603其他大多数开发者没有注意到,如果他们使用苹果的块传输功能,因为苹果可能只是重新实现了较新的处理器。

But this developer had a custom blitting routine to write pixels to the screen which did unaligned loads and stores. Game performance was fine on the 601 but abominable on the 603. Most other developers didn't notice if they used Apple's blitting function, since Apple could just reimplement it for newer processors.

这个故事的寓意是,更好的性能,无论是从软件和硬件改进而来的。

The moral of the story is that better performance comes both from software and hardware improvements.

一般而言,的趋势似乎是从提到的那种硬件优化的方向相反。而在86很容易用汇编写的的memcpy ,一些较新的架构分担更多的工作软件。特别值得注意的是在VLIW体系结构:英特尔IA64(安腾),德州仪器的TMS320C64x DSP和全美达Efficeon处理器是示例。随着VLIW,汇编编程变得更加复杂:你必须明确地选择哪个执行单元获得该指令和命令可以在同一时间,一些现代的x86会为你做来完成(除非它是一个Atom)。所以写的memcpy 突然变得非常非常困难。

In general, the trend seems to be in the opposite direction from the kind of hardware optimizations mentioned. While in x86 it's easy to write memcpy in assembly, some newer architectures offload even more work to software. Of particular note are the VLIW architectures: Intel IA64 (Itanium), the TI TMS320C64x DSPs, and the Transmeta Efficeon are examples. With VLIW, assembly programming gets much more complicated: you have to explicitly select which execution units get which commands and which commands can be done at the same time, something that a modern x86 will do for you (unless it's an Atom). So writing memcpy suddenly gets much, much harder.

这些建筑技巧,让你减少硬件的一大块出你的微处理器,同时保留了超标量设计的性能优势。想象一下,有一个足迹芯片更接近于一个Atom但性能更接近至强。我怀疑编程的难度,这些设备是主要因素阻碍更广泛地采用。

These architectural tricks allow you to cut a huge chunk of hardware out of your microprocessors while retaining the performance benefits of a superscalar design. Imagine having a chip with a footprint closer to an Atom but performance closer to a Xeon. I suspect the difficulty of programming these devices are is the major factor impeding wider adoption.

这篇关于为什么复杂的memcpy / memset的卓越?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆