Optimzing SSE- code [英] Optimzing SSE-code

查看:183
本文介绍了Optimzing SSE- code的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在开发一个Java的应用程序,它需要一些性能改进的C模块(请参阅Improving编码编码-网络性能的背景)。我试着用优化上证所内部函数的code和执行它比Java版本(〜20%)有所加快。但是,它仍然不够快。

不幸的是我与优化的C code的经验是比较有限的。因此,我很想得到关于如何改善目前实施的一些想法。

这构成了热点内环看起来是这样的:

 为(i = 0; I< numberOfGFVectorsInFragment;我++){        //加载从消息片段的4 GF-元件和coefficeint的日志添加到它们。
        __m128i currentMessageFragmentVector = _mm_load_si128(currentMessageFragmentPtr);
        __m128i currentEn codedResult = _mm_load_si128(EN​​ codedFragmentResultArray);        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector,currentMessageFragmentVector);        __m128i updatedResultVector = _mm_xor_si128(currentEn codedResult,values​​ToXor);
        _mm_store_si128(EN​​ codedFragmentResultArray,updatedResultVector);        EN codedFragmentResultArray ++;
        currentMessageFragmentPtr ++;
    }


解决方案

即使不看大会,我可以告诉马上的瓶颈是从4元收集内存访问,并从 _mm_set_epi32 包装操作。在内部, _mm_set_epi32 ,你的情况可能会被实现为unpacklo / HI 指令的一系列

大多数这个循环的工作就是从这些包装4个内存访问。在没有SSE4.1的,我会走这么远说,循环可能会更快非矢量化,而是展开。

如果你愿意使用SSE4.1,你可以试试这个。它可能会更快,它可能不是:

 为int * logSumArray =(INT *)(安培; logSumVector);    __m128i values​​ToXor = _mm_cvtsi32_si128(expTable [*(logSumArray ++)]);
    values​​ToXor = _mm_insert_epi32(values​​ToXor,expTable [*(logSumArray ++)],1);
    values​​ToXor = _mm_insert_epi32(values​​ToXor,expTable [*(logSumArray ++)],2);
    values​​ToXor = _mm_insert_epi32(values​​ToXor,expTable [*(logSumArray ++)],3);

我建议展开循环至少4次迭代,交织所有的指令给这个code表现良好的机会。

你真正需要的是Intel AVX2聚集/分散指令。但是,这几年的道路...

I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.

Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.

The inner loop that constitutes the hot-spot looks like this:

for (i = 0; i < numberOfGFVectorsInFragment; i++)   {

        // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
        __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
        __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);

        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);

        __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
        _mm_store_si128(encodedFragmentResultArray, updatedResultVector);

        encodedFragmentResultArray++;
        currentMessageFragmentPtr++;
    }

解决方案

Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the _mm_set_epi32 packing operations. Internally, _mm_set_epi32, in your case will probably be implemented as a series of unpacklo/hi instructions.

Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.

If you're willing to use SSE4.1, you can try this. It might be faster, it might not:

    int* logSumArray = (int*)(&logSumVector);

    __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);

I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.

What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...

这篇关于Optimzing SSE- code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆