堆栈MMX内部函数和Microsoft的C用法中++ [英] Stack usage with MMX intrinsics and Microsoft C++

查看:174
本文介绍了堆栈MMX内部函数和Microsoft的C用法中++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个内联汇编循环累加元素从MMX指令一个Int32数据数组。尤其是,它使用了MMX寄存器可以容纳16 int32s并行计算16个不同的累加和的事实。

I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel.

我现在想这块code的转换为MMX内部函数,但我怕我会蒙受性能损失,因为人们无法明确intruct编译器使用8个MMX寄存器accomulate 16个独立的款项。

I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums.

任何人都可以在此发表评论,也许提出如何将一块低于code的转换使用内在函数的解决方案?

Can anybody comment on this and maybe propose a solution on how to convert the piece of code below to use intrinsics?

==内联汇编(仅适用于内环路部分)==

== inline assembler (only part within the loop) ==

paddd   mm0, [esi+edx+8*0]  ; add first & second pair of int32 elements
paddd   mm1, [esi+edx+8*1]  ; add third & fourth pair of int32 elements ...
paddd   mm2, [esi+edx+8*2]
paddd   mm3, [esi+edx+8*3]
paddd   mm4, [esi+edx+8*4]
paddd   mm5, [esi+edx+8*5]
paddd   mm6, [esi+edx+8*6]
paddd   mm7, [esi+edx+8*7]  ; add 15th & 16th pair of int32 elements


  • ESI指向数据数组的开头

  • EDX提供的数据数组中的偏移量为当前循环迭代

  • 数据阵列被布置成使得为16个独立的和的元素被交错。

  • 推荐答案

    在VS2010确实使用内部函数,相当于code体面优化工作。在大多数情况下,编译的固有

    The VS2010 does a decent optimization job on the equivalent code using intrinsics. In most cases, it compiles the intrinsic:

    sum = _mm_add_pi32(sum, *(__m64 *) &intArray[i + offset]);
    

    成类似:

    movq  mm0, mmword ptr [eax+8*offset]
    paddd mm1, mm0
    

    这是不是你的 PADD MM1,[ESI + EDX + 8 *偏移] 简明,但它可以说是说到pretty接近。执行时间可能是由内存取为主

    This isn't as concise as your padd mm1, [esi+edx+8*offset], but it arguably comes pretty close. The execution time is likely dominated by the memory fetch.

    美中不足的是,VS似乎很喜欢加入MMX只注册到其他MMX寄存器。上述方案仅适用于第7款项。第八届总和,需要一些寄存器进行临时保存到内存中。

    The catch is that VS seems to like adding MMX registers only to other MMX registers. The above scheme works only for the first 7 sums. The 8th sum requires that some register be saved temporarily to memory.

    下面是一个完整的计划,其相应的编译的程序集(发行版):

    Here's a complete program and its corresponding compiled assembly (release build):

    #include <stdio.h>
    #include <stdlib.h>
    #include <xmmintrin.h>
    
    void addWithInterleavedIntrinsics(int *interleaved, int count)
    {
        // sum up the numbers
        __m64 sum0 = _mm_setzero_si64(), sum1 = _mm_setzero_si64(),
              sum2 = _mm_setzero_si64(), sum3 = _mm_setzero_si64(),
              sum4 = _mm_setzero_si64(), sum5 = _mm_setzero_si64(),
              sum6 = _mm_setzero_si64(), sum7 = _mm_setzero_si64();
    
        for (int i = 0; i < 16 * count; i += 16) {
            sum0 = _mm_add_pi32(sum0, *(__m64 *) &interleaved[i]);
            sum1 = _mm_add_pi32(sum1, *(__m64 *) &interleaved[i + 2]);
            sum2 = _mm_add_pi32(sum2, *(__m64 *) &interleaved[i + 4]);
            sum3 = _mm_add_pi32(sum3, *(__m64 *) &interleaved[i + 6]);
            sum4 = _mm_add_pi32(sum4, *(__m64 *) &interleaved[i + 8]);
            sum5 = _mm_add_pi32(sum5, *(__m64 *) &interleaved[i + 10]);
            sum6 = _mm_add_pi32(sum6, *(__m64 *) &interleaved[i + 12]);
            sum7 = _mm_add_pi32(sum7, *(__m64 *) &interleaved[i + 14]);
        }
    
        // reset the MMX/floating-point state
        _mm_empty();
    
        // write out the sums; we have to do something with the sums so that
        // the optimizer doesn't just decide to avoid computing them.
        printf("%.8x %.8x\n", ((int *) &sum0)[0], ((int *) &sum0)[1]);
        printf("%.8x %.8x\n", ((int *) &sum1)[0], ((int *) &sum1)[1]);
        printf("%.8x %.8x\n", ((int *) &sum2)[0], ((int *) &sum2)[1]);
        printf("%.8x %.8x\n", ((int *) &sum3)[0], ((int *) &sum3)[1]);
        printf("%.8x %.8x\n", ((int *) &sum4)[0], ((int *) &sum4)[1]);
        printf("%.8x %.8x\n", ((int *) &sum5)[0], ((int *) &sum5)[1]);
        printf("%.8x %.8x\n", ((int *) &sum6)[0], ((int *) &sum6)[1]);
        printf("%.8x %.8x\n", ((int *) &sum7)[0], ((int *) &sum7)[1]);
    }
    
    void main()
    {
        int count        = 10000;
        int *interleaved = new int[16 * count];
    
        // create some random numbers to add up
        // (note that on VS2010, RAND_MAX is just 32767)
        for (int i = 0; i < 16 * count; ++i) {
            interleaved[i] = rand();
        }
    
        addWithInterleavedIntrinsics(interleaved, count);
    }
    

    下面是生成的汇编code的总和环的内侧部分(不包括它的序言和结尾)。注意,如何最款项在MM1-MM6有效保存。对比,与MM0,它是用来使数字要添加到每个总和,并与MM7,供应的最后两个总和。这项计划的7森版本似乎并不具备MM7问题。

    Here's the generated assembly code for the inner portion of the sum loop (without its prolog and epilog). Note how most sums are kept efficiently in mm1-mm6. Contrast that with mm0, which is used to bring the number to add to each sum, and with mm7, which serves the last two sums. The 7-sum version of this program doesn't seem to have mm7 problem.

    012D1070  movq        mm7,mmword ptr [esp+18h]  
    012D1075  movq        mm0,mmword ptr [eax-10h]  
    012D1079  paddd       mm1,mm0  
    012D107C  movq        mm0,mmword ptr [eax-8]  
    012D1080  paddd       mm2,mm0  
    012D1083  movq        mm0,mmword ptr [eax]  
    012D1086  paddd       mm3,mm0  
    012D1089  movq        mm0,mmword ptr [eax+8]  
    012D108D  paddd       mm4,mm0  
    012D1090  movq        mm0,mmword ptr [eax+10h]  
    012D1094  paddd       mm5,mm0  
    012D1097  movq        mm0,mmword ptr [eax+18h]  
    012D109B  paddd       mm6,mm0  
    012D109E  movq        mm0,mmword ptr [eax+20h]  
    012D10A2  paddd       mm7,mm0  
    012D10A5  movq        mmword ptr [esp+18h],mm7  
    012D10AA  movq        mm0,mmword ptr [esp+10h]  
    012D10AF  movq        mm7,mmword ptr [eax+28h]  
    012D10B3  add         eax,40h  
    012D10B6  dec         ecx  
    012D10B7  paddd       mm0,mm7  
    012D10BA  movq        mmword ptr [esp+10h],mm0  
    012D10BF  jne         main+70h (12D1070h)  
    

    所以,你该怎么办?

    So what can you do?


    1. 简介7森和8总和基于固有的方案。选择执行迅速的人。

    1. Profile the 7-sum and 8-sum intrinsic-based programs. Choose the one that executes quicker.

    配置文件,同时增加了只有一个MMX寄存器中的版本。它应该仍然能够采取的事实,即现代处理器<一href=\"http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-$p$pfetchers/\"相对=nofollow>在一个时间取64到128字节到缓存中。这不是显而易见的是,8-总和版本将是大于1总和的速度更快。 1-总和版取的内存完全相同的量,不完全相同的数量增加MMX的。您将需要交错输入相应虽然。

    Profile the version that adds just one MMX register at a time. It should still be able to take advantage of the fact that modern processors fetch 64 to 128 bytes into the cache at a time. It is not obvious that the 8-sum version would be faster than the 1-sum one. The 1-sum version fetches the exact same amount of memory, and does the exact same number of MMX additions. You will need to interleave the inputs accordingly though.

    如果您的目标硬件允许的话,可以考虑使用 SSE指令的。那些可以一次添加4的32位值。 SSE是因为奔腾III Intel的CPU的可用在1999年。

    If your target hardware allows it, consider using SSE instructions. Those can add 4 32-bit values at a time. SSE is available in intel CPU's since the Pentium III in 1999.

    这篇关于堆栈MMX内部函数和Microsoft的C用法中++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆