Optimzing SSE- code [英] Optimzing SSE-code
问题描述
我目前正在开发一个Java的应用程序,它需要一些性能改进的C模块(请参阅Improving编码编码-网络性能的背景)。我试着用优化上证所内部函数的code和执行它比Java版本(〜20%)有所加快。但是,它仍然不够快。
不幸的是我与优化的C code的经验是比较有限的。因此,我很想得到关于如何改善目前实施的一些想法。
这构成了热点内环看起来是这样的:
为(i = 0; I< numberOfGFVectorsInFragment;我++){ //加载从消息片段的4 GF-元件和coefficeint的日志添加到它们。
__m128i currentMessageFragmentVector = _mm_load_si128(currentMessageFragmentPtr);
__m128i currentEn codedResult = _mm_load_si128(EN codedFragmentResultArray); __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector,currentMessageFragmentVector); __m128i updatedResultVector = _mm_xor_si128(currentEn codedResult,valuesToXor);
_mm_store_si128(EN codedFragmentResultArray,updatedResultVector); EN codedFragmentResultArray ++;
currentMessageFragmentPtr ++;
}
即使不看大会,我可以告诉马上的瓶颈是从4元收集内存访问,并从 _mm_set_epi32
包装操作。在内部, _mm_set_epi32
,你的情况可能会被实现为unpacklo / HI 指令的一系列的
大多数这个循环的工作就是从这些包装4个内存访问。在没有SSE4.1的,我会走这么远说,循环可能会更快非矢量化,而是展开。
如果你愿意使用SSE4.1,你可以试试这个。它可能会更快,它可能不是:
为int * logSumArray =(INT *)(安培; logSumVector); __m128i valuesToXor = _mm_cvtsi32_si128(expTable [*(logSumArray ++)]);
valuesToXor = _mm_insert_epi32(valuesToXor,expTable [*(logSumArray ++)],1);
valuesToXor = _mm_insert_epi32(valuesToXor,expTable [*(logSumArray ++)],2);
valuesToXor = _mm_insert_epi32(valuesToXor,expTable [*(logSumArray ++)],3);
我建议展开循环至少4次迭代,交织所有的指令给这个code表现良好的机会。
你真正需要的是Intel AVX2聚集/分散指令。但是,这几年的道路...
I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.
Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.
The inner loop that constitutes the hot-spot looks like this:
for (i = 0; i < numberOfGFVectorsInFragment; i++) {
// Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
__m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
__m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);
__m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);
__m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
_mm_store_si128(encodedFragmentResultArray, updatedResultVector);
encodedFragmentResultArray++;
currentMessageFragmentPtr++;
}
Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the _mm_set_epi32
packing operations. Internally, _mm_set_epi32
, in your case will probably be implemented as a series of unpacklo/hi
instructions.
Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.
If you're willing to use SSE4.1, you can try this. It might be faster, it might not:
int* logSumArray = (int*)(&logSumVector);
__m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);
I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.
What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...
这篇关于Optimzing SSE- code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!