代码有效,但速度较慢 [英] codes works but are slow

查看:101
本文介绍了代码有效,但速度较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于如何改进代码以使其快速实现的任何建议?我如何将以下函数重新编写为内联汇编?

Any suggestions on how to improve the code, in order to make it fast?How would I re-write the following function as inline assembly?

void tomSimd::calculations(void* btr)
{
    __declspec(align(8))short* block =(short*)btr;
    
    int j;
    
    __declspec(align(8)) __m64*block1 = (__m64*)block;
    __m64 s0,s1,s2,s3,f0,f1,f2,f3,temp4,temp5,temp6,temp7;
    j=0;
    
    // transpose input
    temp4 = _mm_unpacklo_pi16(block1[j],block1[j+1]);
    temp5 = _mm_unpacklo_pi16(block1[j+2],block1[j+3]);
    temp6 = _mm_unpackhi_pi16(block1[j],block1[j+1]);
    temp7 = _mm_unpackhi_pi16(block1[j+2],block1[j+3]);
    f0 = _mm_unpacklo_pi32(temp4,temp5);
    f2 = _mm_unpacklo_pi32(temp6,temp7);
    f1 = _mm_unpackhi_pi32(temp4,temp5);
    f3 = _mm_unpackhi_pi32(temp6,temp7);
    
    // stage one
    s0 =_mm_add_pi16(f0,f3);
    s3 =_mm_sub_pi16(f0,f3);
    s1 =_mm_add_pi16(f1,f2);
    s2 =_mm_sub_pi16(f1,f2);
    
    //stage 2
    block1[j] =_mm_add_pi16(s0,s1);
    block1[j+2] =_mm_sub_pi16(s0,s1);
    block1[j+1] =_mm_add_pi16(s2,_mm_slli_pi16(s3, 1));
    block1[j+3] =_mm_sub_pi16(s3,_mm_slli_pi16(s2, 1));
    
    _mm_empty();
}

推荐答案

您可以通过不进行强制转换和不使用void * ptr来使其更快.对于编译器而言,这不是容易优化的代码.
you can make it faster by not casting and not using void * ptr. That is not easily optimizeably code for the compiler.


_mm_empty();(汇编指令:emms)是一条昂贵的指令,需要花费很多时间.如果在循环中使用它,则可以考虑将该循环添加到此方法中,以便可以跳过它,直到完全完成(只要不使用任何FP指令即可)

祝你好运!
The _mm_empty(); (assembly instruction: emms) is an expensive instruction that takes quite some cycles. If you use it in a loop you might consider adding that loop into this method so you can skip it until you''re completely done (as long as you don''t use any FP instructions)

Good luck!


这篇关于代码有效,但速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆