Optimzing SSE- code [英] Optimzing SSE-code

查看：183 发布时间：2016/8/22 16:53:37 java c optimization sse vtune

本文介绍了Optimzing SSE- code的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在开发一个Java的应用程序，它需要一些性能改进的C模块（请参阅Improving编码编码-网络性能的背景）。我试着用优化上证所内部函数的code和执行它比Java版本（〜20％）有所加快。但是，它仍然不够快。

不幸的是我与优化的C code的经验是比较有限的。因此，我很想得到关于如何改善目前实施的一些想法。

这构成了热点内环看起来是这样的：

 为（i = 0; I＆LT; numberOfGFVectorsInFragment;我++）{        //加载从消息片段的4 GF-元件和coefficeint的日志添加到它们。
        __m128i currentMessageFragmentVector = _mm_load_si128（currentMessageFragmentPtr）;
        __m128i currentEn codedResult = _mm_load_si128（EN codedFragmentResultArray）;        __m128i logSumVector = _mm_add_epi32（coefficientLogValueVector，currentMessageFragmentVector）;        __m128i updatedResultVector = _mm_xor_si128（currentEn codedResult，valuesToXor）;
        _mm_store_si128（EN codedFragmentResultArray，updatedResultVector）;        EN codedFragmentResultArray ++;
        currentMessageFragmentPtr ++;
    }

解决方案

即使不看大会，我可以告诉马上的瓶颈是从4元收集内存访问，并从 _mm_set_epi32 包装操作。在内部， _mm_set_epi32 ，你的情况可能会被实现为unpacklo / HI 指令的一系列的

大多数这个循环的工作就是从这些包装4个内存访问。在没有SSE4.1的，我会走这么远说，循环可能会更快非矢量化，而是展开。

如果你愿意使用SSE4.1，你可以试试这个。它可能会更快，它可能不是：

 为int * logSumArray =（INT *）（安培; logSumVector）;    __m128i valuesToXor = _mm_cvtsi32_si128（expTable [*（logSumArray ++）]）;
    valuesToXor = _mm_insert_epi32（valuesToXor，expTable [*（logSumArray ++）]，1）;
    valuesToXor = _mm_insert_epi32（valuesToXor，expTable [*（logSumArray ++）]，2）;
    valuesToXor = _mm_insert_epi32（valuesToXor，expTable [*（logSumArray ++）]，3）;

我建议展开循环至少4次迭代，交织所有的指令给这个code表现良好的机会。

你真正需要的是Intel AVX2聚集/分散指令。但是，这几年的道路...

I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.



Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. 

The inner loop that constitutes the hot-spot looks like this:
for (i = 0; i < numberOfGFVectorsInFragment; i++)   {

        // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
        __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
        __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);

        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);

        __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
        _mm_store_si128(encodedFragmentResultArray, updatedResultVector);

        encodedFragmentResultArray++;
        currentMessageFragmentPtr++;
    }

 解决方案 
Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the _mm_set_epi32 packing operations. Internally, _mm_set_epi32, in your case will probably be implemented as a series of unpacklo/hi instructions.

Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.

If you're willing to use SSE4.1, you can try this. It might be faster, it might not:
    int* logSumArray = (int*)(&logSumVector);

    __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);
I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.

What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...

                        这篇关于Optimzing SSE- code的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Optimzing SSE- code [英] Optimzing SSE-code

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Optimzing SSE- code [英] Optimzing SSE-code

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭