快速ARM NEON的memcpy [英] Fast ARM NEON memcpy

查看:4610
本文介绍了快速ARM NEON的memcpy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要在图像上的ARMv7的核心复制。天真的实现是调用每行的memcpy。

I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line.

for(i = 0; i < h; i++) {
  memcpy(d, s, w);
  s += sp;
  d += dp;
}

我知道以下

d, dp, s, sp, w

均为32字节对​​齐,所以我的下​​一个(仍然相当幼稚)实现是沿着线

are all 32-byte aligned, so my next (still quite naive) implementation was along the lines of

for (int i = 0; i < h; i++) {
  uint8_t* dst = d;
  const uint8_t* src = s;
  int remaining = w;
  asm volatile (
    "1:                                               \n"
    "subs     %[rem], %[rem], #32                     \n"
    "vld1.u8  {d0, d1, d2, d3}, [%[src],:256]!        \n"
    "vst1.u8  {d0, d1, d2, d3}, [%[dst],:256]!        \n"
    "bgt      1b                                      \n"
    : [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining)
    :
    : "d0", "d1", "d2", "d3", "cc", "memory"
  );
  d += dp;
  s += sp;
}

这比的memcpy快过大量迭代〜150%(按不同的图像,所以没有利用高速缓存的)。我觉得这应该是无处最佳附近,因为我还没有使用preloading,但是当我做,我只似乎能够使性能显着恶化。没有人有任何见解吗?

Which was ~150% faster than memcpy over a large number of iterations (on different images, so not taking advantage of caching). I feel like this should be nowhere near the optimum because I am yet to use preloading, but when I do I only seem to be able to make performance substantially worse. Does anyone have any insight here?

推荐答案

ARM对这个伟大的技术说明。

ARM has a great tech note on this.

的http://信息中心.arm.com /帮助/ index.jsp的?主题= / com.arm.doc.faqs / ka13544.html

根据不同的微架构你的表现肯定会有所不同,ARM公司的说明是在A8,但我认为它会给你一个体面的想法,并在底部的总结是,去的各种利弊的大讨论不仅仅是常规数字,比如哪些方法导致寄存器使用,等等。

Your performance will definitely vary depending on the micro-architecture, ARM's note is on the A8 but I think it will give you a decent idea, and the summary at the bottom is a great discussion of the various pros and cons that go beyond just the regular numbers, such as which methods result in the least amount of register usage, etc.

是的,正如一位​​网民提及,pre取是很难得到的权利,并将与不同的微架构的工作方式不同,这取决于如何大的缓存,以及如何大的每一行是和一堆关于缓存设计等细节。您可以结束了,如果你不小心,你需要颠簸线。我会建议避免它便携code。

And yes, as another commenter mentions, pre-fetching is very difficult to get right, and will work differently with different micro-architectures, depending on how big the caches are and how big each line is and a bunch of other details about the cache design. You can end up thrashing lines you need if you aren't careful. I would recommend avoiding it for portable code.

这篇关于快速ARM NEON的memcpy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆