快速 ARM NEON memcpy [英] Fast ARM NEON memcpy
问题描述
我想在 ARMv7 内核上复制图像.最简单的实现是每行调用 memcpy.
I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line.
for(i = 0; i < h; i++) {
memcpy(d, s, w);
s += sp;
d += dp;
}
我知道以下内容
d, dp, s, sp, w
都是 32 字节对齐的,所以我的下一个(仍然很幼稚)实现是沿着
are all 32-byte aligned, so my next (still quite naive) implementation was along the lines of
for (int i = 0; i < h; i++) {
uint8_t* dst = d;
const uint8_t* src = s;
int remaining = w;
asm volatile (
"1: \n"
"subs %[rem], %[rem], #32 \n"
"vld1.u8 {d0, d1, d2, d3}, [%[src],:256]! \n"
"vst1.u8 {d0, d1, d2, d3}, [%[dst],:256]! \n"
"bgt 1b \n"
: [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining)
:
: "d0", "d1", "d2", "d3", "cc", "memory"
);
d += dp;
s += sp;
}
在大量迭代中比 memcpy 快约 150%(在不同的图像上,因此没有利用缓存).我觉得这应该离最佳状态还差得很远,因为我还没有使用预加载,但是当我使用预加载时,我似乎只能使性能变得更糟.有没有人对此有任何见解?
Which was ~150% faster than memcpy over a large number of iterations (on different images, so not taking advantage of caching). I feel like this should be nowhere near the optimum because I am yet to use preloading, but when I do I only seem to be able to make performance substantially worse. Does anyone have any insight here?
推荐答案
ARM 对此有很好的技术说明.
ARM has a great tech note on this.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
你的性能肯定会因微架构而异,ARM 的说明是在 A8 上,但我认为它会给你一个不错的主意,底部的总结是对各种利弊的很好的讨论不仅仅是常规数字,例如哪些方法导致最少的寄存器使用量等.
Your performance will definitely vary depending on the micro-architecture, ARM's note is on the A8 but I think it will give you a decent idea, and the summary at the bottom is a great discussion of the various pros and cons that go beyond just the regular numbers, such as which methods result in the least amount of register usage, etc.
是的,正如另一位评论者所提到的,预取是非常困难的,并且在不同的微架构下会以不同的方式工作,这取决于缓存有多大、每行有多大以及一堆其他细节关于缓存设计.如果你不小心,你最终可能会颠簸你需要的线路.我建议避免它用于可移植代码.
And yes, as another commenter mentions, pre-fetching is very difficult to get right, and will work differently with different micro-architectures, depending on how big the caches are and how big each line is and a bunch of other details about the cache design. You can end up thrashing lines you need if you aren't careful. I would recommend avoiding it for portable code.
这篇关于快速 ARM NEON memcpy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!