内在函数中 Neon 的校验和代码实现 [英] Checksum code implementation for Neon in Intrinsics

查看:27
本文介绍了内在函数中 Neon 的校验和代码实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用内在函数为 NEON 实现校验和计算代码(2 的补码加法).当前校验和计算正在 ARM 上进行.

I'm trying to implement the checksum computation code(2's complement addition) for NEON, using intrinsic. The current checksum computation is being carried out on ARM.

我的实现一次从内存中取出 128 位到 NEON 寄存器中并进行 SIMD(加法),结果从 128 位数字折叠为 16 位数字.

My implementation fetches 128-bits at once from the memory into NEON registers and does SIMD (addition), and result is folded to a 16-bit number from a 128-bit number.

看起来一切正常,但我的 NEON 实现比 ARM 版本花费更多的时间.

Everything looks to be working fine, but my NEON implementation is consuming more time that of the ARM version.

ARM 版本需要:0.860000 秒NEON 版本需要:1.260000 秒

ARM version takes: 0.860000 s NEON version takes: 1.260000 s

注意:

  1. 使用time.h"中的实用程序进行分析
  2. 从示例应用程序中调用了 10,000 次校验和函数,并在所有函数运行完成后计算时间

其他详情:

  1. 使用 GNU 工具链 (arm-none-linux-gnueabi-gcc) 来编译内部代码,而不是 arm 工具链.
  2. Linux 平台.
  3. C 内在代码.

问题:

  1. 为什么 NEON 版本比 ARM 版本需要更多时间?(虽然我已经注意在批处理中使用了最小周期的内在)

  1. Why does NEON version take more time than that of the ARM version? (Although I have taken care that intrinsic with minimum cycles in the batch is used)

如何实现我想要实现的目标?(NEON 的效率)

How do achieve what I want to achieve? (efficiency with NEON)

有人可以指点我或分享一些使用 ARM-NEON 互操作的示例实现(伪代码/算法/代码,而不是理论实现论文或演讲)吗?

Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical implementation papers or talks) which uses the inter-operations of ARM-NEON together?

任何帮助将不胜感激.

这是我的代码:

uint16_t do_csum(const unsigned char * buff, int len)
{
int odd, count, i;

uint32x4_t result = veorq_u32( result, result), sum = veorq_u32( sum, sum); 
uint16x4_t data, data_hi, data_low, data8;
uint16x8_t dataq;
uint16_t result16, disp[20] = {0,0,0,0,0,0,0,0,0,0};

if (len <= 0)
    goto out;
odd = 1 & (unsigned long) buff;
if (odd) {
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t)vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    data1 = (uint16x4_t)vshl_n_u16( data1, 8);

    len--;
    buff++;
    result = vaddw_u16(result, data1);
}
count = len >> 1;       /* nr of 16-bit words.. */
if (count) {
    if (2 & (unsigned long) buff) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        count--;
        len -= 2;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
    count >>= 1;        /* nr of 32-bit words.. */
    if (count) {
        if (4 & (unsigned long) buff) {
            uint32x2_t data4 = (uint16x4_t) vld1_lane_u32((uint32_t *) buff, data4, 0);
            count--;
            len -= 4;
            buff += 4;
            result = vaddw_u16( result, data4);
        }
        count >>= 1;    /* nr of 64-bit words.. */
        if (count) {
            if (8 & (unsigned long) buff) {
                uint64x1_t data8 = vld1_u64((uint64_t *) buff); 
                count--;
                len -= 8;
                buff += 8;
                result = vaddw_u16( result,(uint16x4_t)data8);
            }
            count >>= 1;    /* nr of 128-bit words.. */
            if (count) {
                do {
                    dataq = (uint16x8_t)vld1q_u64((uint64_t *) buff); // VLD1.64 {d0, d1}, [r0]
                    count--;
                    buff += 16;

                    sum = vpaddlq_u16(dataq);   
                    vst1q_u16( disp, dataq); // VST1.16 {d0, d1}, [r0]

                    result = vaddq_u32( sum, result);
                } while (count);
            }
            if (len & 8) {
                uint64x1_t data8 =  vld1_u64((uint64_t *) buff); 
                buff += 8;
                result = vaddw_u16( result, (uint16x4_t)data8);
            }
        }
        if (len & 4) {
            uint32x2_t data4 = veor_u32( data4, data4); 

            data4 = (uint16x4_t)vld1_lane_u32((uint32_t *) buff, data4, 0);//result += *(unsigned int *) buff;
            buff += 4;
            result = vaddw_u16( result,(uint16x4_t) data4);
        }
    }
    if (len & 2) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
}
if (len & 1){
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t) vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    result = vaddw_u8( result, data1);
}


result16 = from128to16(result);

if (odd)
    result16 = ((result16 >> 8) & 0xff) | ((result16 & 0xff) << 8);

out:
    return result16;
}

推荐答案

一些可以改进的地方:

  • 摆脱存储到 disp - 这看起来像是留在了调试代码中?
  • 不要在主循环中进行水平加法 - 只需在循环中进行部分(垂直)求和,然后在循环后进行最后一次水平加法(请参阅 this answer 举例说明如何执行此操作 - 适用于 SSE,但原理相同)
  • 确保您使用 gcc -O3 ... 以从编译器优化中获得最大收益
  • 不要使用 goto !(不影响性能但很邪恶.)
  • Get rid of the stores to disp - this looks like debug code that got left in ?
  • Don't do horizontal addition within your main loop - just do partial (vertical) sums in the loop and do one final horizontal addition after the loop (see this answer for an example of how to do this - it's for SSE but the principle is the same)
  • Make sure you use gcc -O3 ... to get maximum benefit from compiler optimisation
  • Don't use goto ! (Doesn't affect performance but is evil.)

这篇关于内在函数中 Neon 的校验和代码实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆