是什么让Apple的PowerPC速度如此之快? [英] What makes Apple's PowerPC memcpy so fast?

查看：109 发布时间：2020/5/8 18:48:18 optimization memcpy powerpc shark altivec

本文介绍了是什么让Apple的PowerPC速度如此之快?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写了几个复制函数，以期在PowerPC上寻求良好的存储策略.将Altivec或fp寄存器与缓存提示(dcb *)配合使用，可以在大数据的简单字节复制循环中使性能提高一倍.最初对此感到满意，我定期进行了一次memcpy比赛，看看它的性能如何……比我最好的快10倍！我无意重写memcpy，但我确实希望从中学习并加速几个简单的图像滤镜，这些滤镜会花费大部分时间将像素移入和移出内存.

I've written several copy functions in search of a good memory strategy on PowerPC. Using the Altivec or fp registers with cache hints (dcb*) doubles the performance over a simple byte copy loop for large data. Initially pleased with that, I threw in a regular memcpy to see how it compared... 10x faster than my best! I have no intention of rewriting memcpy, but I do hope to learn from it and accelerate several simple image filters that spend most of their time moving pixels to and from memory.

Shark分析显示，它们的内部循环使用dcbt进行预取，先读取4个向量，然后再写入4个向量.在调整了我的最佳功能以使每次迭代也可以拖出64个字节之后，memcpy的性能优势仍然令人尴尬.我正在使用dcbz释放带宽，Apple没有使用任何东西，但是两种代码在商店中都犹豫不决.

Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration, the performance advantage of memcpy is still embarrassing. I'm using dcbz to free up bandwidth, Apple uses nothing, but both codes tend to hesitate on stores.


prefetch
  dcbt future
  dcbt distant future
load stuff
  lvx image
  lvx image + 16
  lvx image + 32
  lvx image + 48
  image += 64
prepare to store
  dcbz filtered
  dcbz filtered + 32
store stuff
  stvxl filtered
  stvxl filtered + 16
  stvxl filtered + 32
  stvxl filtered + 48
  filtered += 64
repeat

有人对为什么非常相似的代码具有如此巨大的性能差距有任何想法吗?我很乐意在memcpy使用的任何秘密调味酱中腌制真实的图像过滤器！

Does anyone have some ideas on why very similar code has such a dramatic performance gap? I'd love to marinate the real image filters in whatever secret sauce memcpy is using!

其他信息:所有数据都是矢量对齐的.我正在制作图像的过滤副本，而不是替换原始副本.该代码在PowerPC G4，G5和Cell PPU上运行. Cell SPU版本已经非常快了.

Additional info: All data is vector aligned. I'm making filtered copies of the image, not replacing the original. The code runs on PowerPC G4, G5, and Cell PPU. The Cell SPU version is already insanely fast.

推荐答案

Shark分析显示，它们的内部循环使用dcbt进行预取，先读取4个向量，然后再写入4个向量.调整完我的最佳功能后，每次迭代也可以拖出64个字节

Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration

我可能要说的很明显，但是由于您根本没有提到以下问题，因此可能值得指出:

I may be stating the obvious, but since you don't mention the following at all in your question, it may be worth pointing it out:

我敢打赌，苹果选择4种矢量读取，然后选择4种矢量写入与，就像它具有神奇的64字节完美行长一样.您是否注意到Nick Bastin链接的bcopy.s中的行跳过了?这意味着开发人员考虑了G5将如何使用指令流.如果要重现相同的性能，一次读取64个字节的数据是不够的，则必须确保指令组已满(基本上，我记得指令最多可以由五个独立的指令分组，其中前四个是非跳转指令，第五个只允许跳转.(细节更为复杂).

I would bet that Apple's choice of 4 vectors reads followed by 4 vector writes has as much to do with the G5's pipeline and its management of out-of-order instruction execution in "dispatch groups" as it has with a magical 64-byte perfect line size. Did you notice the line skips in Nick Bastin's linked bcopy.s? These mean that the developer thought about how the instruction stream would be consumed by the G5. If you want to reproduce the same performance, it's not enough to read data 64 bytes at a time, you must make sure your instruction groups are well filled (basically, I remember that instructions can be grouped by up to five independent ones, with the first four being non-jump instructions and the fifth only being allowed to be a jump. The details are more complicated).

您可能也对同一页面上的以下段落感兴趣:

you may also be interested by the following paragraph on the same page:

dcbz指令仍然按照G4和G3将对齐的32个字节的内存段清零.但是，由于这不是G5上的完整缓存行，因此不会具有您可能希望获得的性能优势. G5新引入了dcbzl指令，该指令将整个128字节的高速缓存行置零.

The dcbz instruction still zeros aligned 32 byte segments of memory as per the G4 and G3. However, since that is not a full cacheline on a G5 it will not have the performance benefits that you were likely hoping for. There is a dcbzl instruction newly introduced for the G5 that zeros a full 128-byte cacheline.

这篇关于是什么让Apple的PowerPC速度如此之快?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是什么让Apple的PowerPC速度如此之快? [英] What makes Apple's PowerPC memcpy so fast?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

是什么让Apple的PowerPC速度如此之快? [英] What makes Apple&#39;s PowerPC memcpy so fast?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

是什么让Apple的PowerPC速度如此之快? [英] What makes Apple's PowerPC memcpy so fast?

登录关闭