ARM/霓虹灯Memcpy是否针对*未缓存*内存进行了优化? [英] ARM/neon memcpy optimized for *uncached* memory?
问题描述
我正在使用基于Xilinx Zynq 7000 ARM的SoC.我正在与DMA缓冲区(
I'm using a Xilinx Zynq 7000 ARM-based SoC. I'm struggling with DMA buffers (Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)), so one thing I pursued was faster memcpy.
我一直在寻找使用Neon指令和内联asm为ARM编写更快的memcpy.无论glibc拥有什么,这都是可怕的,特别是如果我们要从一个已隔离的DMA缓冲区进行复制.
I've been looking at writing a faster memcpy for ARM using Neon instructions and inline asm. Whatever glibc has, it's terrible, especially if we're copying from an ucached DMA buffer.
我从各种来源整理了自己的复制功能,包括:
I've put together my own copy function from various sources, including:
- 快速ARM霓虹灯memcpy
- gcc中的arm内联汇编
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
- Fast ARM NEON memcpy
- arm Inline assembly in gcc
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
对我来说,主要区别在于我试图从未缓存缓冲区中进行复制,因为它是DMA缓冲区,而ARM对缓存的DMA缓冲区的支持不存在.
The main difference for me is that I'm trying to copy from an uncached buffer because it's a DMA buffer, and ARM support for cached DMA buffers is nonexistent.
这就是我写的:
void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
if (sz & 63) {
sz = (sz & -64) + 64;
}
asm volatile (
"NEONCopyPLD: \n"
" VLDM %[src]!,{d0-d7} \n"
" VSTM %[dst]!,{d0-d7} \n"
" SUBS %[sz],%[sz],#0x40 \n"
" BGT NEONCopyPLD \n"
: [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}
我主要要做的是省去预取指令,因为我认为这对未缓存的内存毫无用处.
The main thing I did was leave out the prefetch instruction because I figured it would be worthless on uncached memory.
这样做可以使glibc memcpy的速度提高4.7倍.速度从约70MB/秒增至约330MB/秒.
Doing this resulted in a speedup of 4.7x over the glibc memcpy. Speed went from about 70MB/sec to about 330MB/sec.
不幸的是,这不快于缓存内存中的memcpy,对于系统memcpy而言,其运行速度约为720MB/秒,对于Neon版本,它的运行速度约为620MB/秒(可能速度较慢,因为我的memcpy不会进行预取)
Unfortunately, this isn't nearly as fast as memcpy from cached memory, which runs at around 720MB/sec for system memcpy and 620MB/sec for the Neon version (probably slower because my memcpy doesn't do prefetching, perhaps).
谁能帮助我找出可以弥补这一性能差距的方法?
Can anyone help me figure out what I can do make up for this performance gap?
我尝试了很多事情,例如一次复制更多,两次加载,然后两次存储.我可以尝试预取只是为了证明它没有用.还有其他想法吗?
I've tried a number of things like copying more at once, two loads followed by two stores. I could try prefetch just to prove that it's useless. Any other ideas?
推荐答案
您可以尝试使用缓冲内存而不是非缓存内存.
You can try to use the buffered memory rather than non-cached memory.
这篇关于ARM/霓虹灯Memcpy是否针对*未缓存*内存进行了优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!