用霓虹灯内在函数代替memcpy [英] Replacing memcpy with neon intrinsics

查看：106 发布时间：2021/4/9 19:26:16 arm simd neon cortex-a

本文介绍了用霓虹灯内在函数代替memcpy的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图通过编写相同的霓虹灯内在函数来击败"memcpy"功能.以下是我的逻辑:

I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic :

uint8_t* m_input;  //Size as 400 x300
uint8_t* m_output; //Size as 400 x300
//not mentioning the complete code base for memory creat 

memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400);

霓虹灯:

int32_t ht_index,wd_index;
uint8x16_t vector8x16_image;

for(int32_t htI =0;htI < m_roiHeight;htI++){
    ht_index = htI * m_roiWidth ;

    for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){
        wd_index = ht_index + wdI;
        vector8x16_image = vld1q_u8(m_input);

        vst1q_u8(&m_output[wd_index],vector8x16_image);
    }
}

我在imx6硬件上多次验证了这些结果.

I verified multiple times these result on imx6 hardware.

结果:

Memcpy :0.039 milisec
neon memcpy: 0.02841 milisec

我了解到一些没有预先说明的内容我们无法击败记忆的人.

I READ SOMEWHERE THAT WITHOUT PRELOADED INSTRUCTIONS WE CAN NOT BEAT MEMCPY.

如果是真的，那么我的代码将如何给出这些结果.是对还是错

If it is true then how my code is giving these results . Is it right or wrong

推荐答案

如果正确编写，非NEON memcpy()应该能够饱和设备上的L3带宽，但对于较小的设备传输(完全适合L1或L2缓存)可能有所不同.您的测试可能适合二级缓存.

If correctly written, a non-NEON memcpy() should be able to saturate the L3 bandwidth on your device, but for smaller transfers (fitting entirely within L1 or L2 cache) things can be different. Your test probably fits within L2 cache.

不幸的是， memcpy 必须适用于任何大小的调用，因此它无法在优化超短副本的同时合理地优化缓存中和缓存外的情况，检测哪种优化最好的成本成为主要因素.

Unfortunately memcpy has to work for any sized call, so it can't reasonably optimise for in-cache and out-of-cache cases at the same time as optimising for very short copies where the cost of detecting what kind of optimisation would be best turns out to be the dominant factor.

即使如此，您的考试仍可能是不公平的.您必须确保两种实现都不受不同的缓存前提条件或不同的虚拟页面布局的约束.

Even so, it's possible that your test isn't fair. You have to be sure that both implementations aren't subject to different cache preconditions or different virtual page layout.

确保没有一项测试完全在另一项测试之前运行.测试一个实现的一部分，然后测试另一个实现，然后几次返回第一个和第二个，以确保它们不受任何预热条件的影响.并为两者使用相同的缓冲区，以确保虚拟地址空间不同部分的特性不会仅损害一种实现.

Make sure neither test is run entirely before the other. Test some of one implementation, then test some of the other, then back to the first and back to the second a few times, to make sure they're not subject to any warm-up conditions. And use the same buffers for both to ensure that there's no characteristic of different parts of your virtual address space that harms one implementation only.

此外，在某些情况下，您的 memcpy 无法处理，但对于大笔汇款来说，这些都无关紧要.

Also, there are cases your memcpy doesn't handle, but these shouldn't matter much for large transfers.

这篇关于用霓虹灯内在函数代替memcpy的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用霓虹灯内在函数代替memcpy [英] Replacing memcpy with neon intrinsics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用霓虹灯内在函数代替memcpy [英] Replacing memcpy with neon intrinsics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭