ARM/霓虹灯Memcpy是否针对*未缓存*内存进行了优化? [英] ARM/neon memcpy optimized for *uncached* memory?

查看:205
本文介绍了ARM/霓虹灯Memcpy是否针对*未缓存*内存进行了优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用基于Xilinx Zynq 7000 ARM的SoC.我正在与DMA缓冲区(

I'm using a Xilinx Zynq 7000 ARM-based SoC. I'm struggling with DMA buffers (Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)), so one thing I pursued was faster memcpy.

我一直在寻找使用Neon指令和内联asm为ARM编写更快的memcpy.无论glibc拥有什么,这都是可怕的,特别是如果我们要从一个已隔离的DMA缓冲区进行复制.

I've been looking at writing a faster memcpy for ARM using Neon instructions and inline asm. Whatever glibc has, it's terrible, especially if we're copying from an ucached DMA buffer.

我从各种来源整理了自己的复制功能,包括:

I've put together my own copy function from various sources, including:

  • Fast ARM NEON memcpy
  • arm Inline assembly in gcc
  • http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

对我来说,主要区别在于我试图从未缓存缓冲区中进行复制,因为它是DMA缓冲区,而ARM对缓存的DMA缓冲区的支持不存在.

The main difference for me is that I'm trying to copy from an uncached buffer because it's a DMA buffer, and ARM support for cached DMA buffers is nonexistent.

这就是我写的:

void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
    if (sz & 63) {
        sz = (sz & -64) + 64;
    }
    asm volatile (
        "NEONCopyPLD:                          \n"
        "    VLDM %[src]!,{d0-d7}                 \n"
        "    VSTM %[dst]!,{d0-d7}                 \n"
        "    SUBS %[sz],%[sz],#0x40                 \n"
        "    BGT NEONCopyPLD                  \n"
        : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}

我主要要做的是省去预取指令,因为我认为这对未缓存的内存毫无用处.

The main thing I did was leave out the prefetch instruction because I figured it would be worthless on uncached memory.

这样做可以使glibc memcpy的速度提高4.7倍.速度从约70MB/秒增至约330MB/秒.

Doing this resulted in a speedup of 4.7x over the glibc memcpy. Speed went from about 70MB/sec to about 330MB/sec.

不幸的是,这不快于缓存内存中的memcpy,对于系统memcpy而言,其运行速度约为720MB/秒,对于Neon版本,它的运行速度约为620MB/秒(可能速度较慢,因为我的memcpy不会进行预取)

Unfortunately, this isn't nearly as fast as memcpy from cached memory, which runs at around 720MB/sec for system memcpy and 620MB/sec for the Neon version (probably slower because my memcpy doesn't do prefetching, perhaps).

谁能帮助我找出可以弥补这一性能差距的方法?

Can anyone help me figure out what I can do make up for this performance gap?

我尝试了很多事情,例如一次复制更多,两次加载,然后两次存储.我可以尝试预取只是为了证明它没有用.还有其他想法吗?

I've tried a number of things like copying more at once, two loads followed by two stores. I could try prefetch just to prove that it's useless. Any other ideas?

推荐答案

您可以尝试使用缓冲内存而不是非缓存内存.

You can try to use the buffered memory rather than non-cached memory.

这篇关于ARM/霓虹灯Memcpy是否针对*未缓存*内存进行了优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆