用霓虹灯内在函数替换 memcpy [英] Replacing memcpy with neon intrinsics

查看:46
本文介绍了用霓虹灯内在函数替换 memcpy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过为相同的 .以下是我的逻辑:

I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic :

uint8_t* m_input;  //Size as 400 x300
uint8_t* m_output; //Size as 400 x300
//not mentioning the complete code base for memory creat 

memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400);

霓虹灯:

int32_t ht_index,wd_index;
uint8x16_t vector8x16_image;

for(int32_t htI =0;htI < m_roiHeight;htI++){
    ht_index = htI * m_roiWidth ;

    for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){
        wd_index = ht_index + wdI;
        vector8x16_image = vld1q_u8(m_input);

        vst1q_u8(&m_output[wd_index],vector8x16_image);
    }
}

我在 imx6 硬件上多次验证了这些结果.

I verified multiple times these result on imx6 hardware.

结果:

Memcpy :0.039 milisec
neon memcpy: 0.02841 milisec

我在某些地方读到,如果没有预先加载的说明,我们就无法击败 MEMCPY.

I READ SOMEWHERE THAT WITHOUT PRELOADED INSTRUCTIONS WE CAN NOT BEAT MEMCPY.

如果这是真的,那么我的代码是如何给出这些结果的.是对是错

If it is true then how my code is giving these results . Is it right or wrong

推荐答案

如果编写正确,非 NEON memcpy() 应该能够使设备上的 L3 带宽饱和,但对于较小的传输(完全适合 L1 或 L2 缓存)事情可能会有所不同.您的测试可能适合 L2 缓存.

If correctly written, a non-NEON memcpy() should be able to saturate the L3 bandwidth on your device, but for smaller transfers (fitting entirely within L1 or L2 cache) things can be different. Your test probably fits within L2 cache.

不幸的是,memcpy 必须适用于任何大小的调用,因此它无法在优化缓存内和缓存外情况的同时合理优化非常短的副本,其中检测哪种优化最好的成本是主导因素.

Unfortunately memcpy has to work for any sized call, so it can't reasonably optimise for in-cache and out-of-cache cases at the same time as optimising for very short copies where the cost of detecting what kind of optimisation would be best turns out to be the dominant factor.

即便如此,您的测试也可能不公平.您必须确保两种实现都不受不同的缓存前提条件或不同的虚拟页面布局的影响.

Even so, it's possible that your test isn't fair. You have to be sure that both implementations aren't subject to different cache preconditions or different virtual page layout.

确保两个测试都完全在另一个之前运行.测试一个实现中的一些,然后测试其他一些,然后回到第一个,再回到第二个几次,以确保它们不受任何预热条件的影响.并为两者使用相同的缓冲区,以确保您的虚拟地址空间的不同部分没有仅损害一种实现的特征.

Make sure neither test is run entirely before the other. Test some of one implementation, then test some of the other, then back to the first and back to the second a few times, to make sure they're not subject to any warm-up conditions. And use the same buffers for both to ensure that there's no characteristic of different parts of your virtual address space that harms one implementation only.

此外,有些情况您的 memcpy 无法处理,但对于大型传输而言,这些应该无关紧要.

Also, there are cases your memcpy doesn't handle, but these shouldn't matter much for large transfers.

这篇关于用霓虹灯内在函数替换 memcpy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆