MMX代码运行速度比C ++代码慢 [英] MMX Codes running slower then C++ codes
本文介绍了MMX代码运行速度比C ++代码慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
为什么我的MMX代码运行速度慢于c ++代码(绿色)?结果是一样的.仅有速度差
Why are my MMX codes running slower than the c++ codes(in green)? result is the same. only speed difference
void tom::add(void* btr)
{
__declspec(align(8))short* b =(short*)btr;
int j;
/*
for(j = 0; j < 4; j++)
{
/// 1st stage add.
int s0 = (int)(b[j] + b[j+3]);
int s3 = (int)(b[j] - b[j+3]);
int s1 = (int)(b[j+1] + b[j+2]);
int s2 = (int)(b[j+1] - b[j+2]);
/// 2nd stage add.
b[j] = (short)(s0 + s1);
b[j+8] = (short)(s0 - s1);
b[j+4] = (short)(s2 + (s3 << 1));
b[j+12] = (short)(s3 - (s2 << 1));
}//end for j...
*/
__m64*b1 = (__m64*)b;
j=0;
__m64 f0 = _mm_set_pi16(b[j+12],b[j+8],b[j+4],b[j]);
__m64 f1 = _mm_set_pi16(b[j+13],b[j+9],b[j+5],b[j+1]);
__m64 f2 = _mm_set_pi16(b[j+14],b[j+10],b[j+6],b[j+2]);
__m64 f3 = _mm_set_pi16(b[j+15],b[j+11],b[j+7],b[j+3]);
for(j = 0; j < 4; j+=4)
{
// stage one add
__m64 s0 =_mm_add_pi16(f0,f3);
__m64 s3 =_mm_sub_pi16(f0,f3);
__m64 s1 =_mm_add_pi16(f1,f2);
__m64 s2 =_mm_sub_pi16(f1,f2);
// stage two add
*(&b1[j]) =_mm_add_pi16(s0,s1);
*(&b1[j+2]) =_mm_sub_pi16(s0,s1);
*(&b1[j+1]) =_mm_add_pi16(s2,_mm_slli_pi16(s3, 1));
*(&b1[j+3]) =_mm_sub_pi16(s3,_mm_slli_pi16(s2, 1));
}
_mm_empty();
}
推荐答案
在您的 MMX 版本中,您有一个循环声明为for(j = 0; j < 4; j += 4)
;不需要它,它只执行一次,然后可以将其删除并假定为j=0
.
您是否正在使用一些编译器优化和/或启用了 SSE 指令?一般来说,由编译器进行的优化往往比您可以使用汇编程序手动编写的代码好得多.
In your MMX version you have a loop declared asfor(j = 0; j < 4; j += 4)
; this is not needed, it executes just once, then you could remove it and assumej=0
.
Are you using some compiler optimization and/or enabled the SSE instructions? Generally speaking, optimizations made by the compiler tends to be too much better than the code that you can manually write using assembly.
这篇关于MMX代码运行速度比C ++代码慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文