高效的4x4矩阵乘法(C VS组装) [英] Efficient 4x4 matrix multiplication (C vs assembly)

查看:278
本文介绍了高效的4x4矩阵乘法(C VS组装)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找繁殖的C.两个4X4矩阵我目前的研究主要集中在x86-64的组装与SIMD扩展更快,棘手的方式。到目前为止,我已经创建了一个功能女巫约6倍比一个天真的C语言实现,这已经超出了我的性能改善的预期更快。不幸的是,保持真实,只有当没有优化标志用于编译(GCC 4.7)。随着 -O2 ,C变快和我的努力变得毫无意义。

I'm looking for a faster and trickier way to multiply two 4x4 matrices in C. My current research is focused on x86-64 assembly with SIMD extensions. So far, I've created a function witch is about 6x faster than a naive C implementation, which has exceeded my expectations for the performance improvement. Unfortunately, this stays true only when no optimization flags are used for compilation (GCC 4.7). With -O2, C becomes faster and my effort becomes meaningless.

我知道,现代的编译器使用复杂的优化技术,以达到近乎完美的code,通常比一个巧妙的一块手工装配crafed更快。但在性能危重病例不在少数,人类可能会尝试与编译器时钟周期战斗。特别是,当与现代ISA支持一些数学可以探讨(因为它是在我的情况)。

I know that modern compilers make use of complex optimization techniques to achieve an almost perfect code, usually faster than an ingenious piece of hand-crafed assembly. But in a minority of performance-critical cases, a human may try to fight for clock cycles with the compiler. Especially, when some mathematics backed with a modern ISA can be explored (as it is in my case).

我的函数如下所示(AT& T公司的语法,GNU汇编程序):

My function looks as follows (AT&T syntax, GNU Assembler):

    .text
    .globl matrixMultiplyASM
    .type matrixMultiplyASM, @function
matrixMultiplyASM:
    movaps   (%rdi), %xmm0    # fetch the first matrix (use four registers)
    movaps 16(%rdi), %xmm1
    movaps 32(%rdi), %xmm2
    movaps 48(%rdi), %xmm3
    xorq %rcx, %rcx           # reset (forward) loop iterator
.ROW:
    movss (%rsi), %xmm4       # Compute four values (one row) in parallel:
    shufps $0x0, %xmm4, %xmm4 # 4x 4FP mul's, 3x 4FP add's 6x mov's per row,
    mulps %xmm0, %xmm4        # expressed in four sequences of 5 instructions,
    movaps %xmm4, %xmm5       # executed 4 times for 1 matrix multiplication.
    addq $0x4, %rsi

    movss (%rsi), %xmm4       # movss + shufps comprise _mm_set1_ps intrinsic
    shufps $0x0, %xmm4, %xmm4 #
    mulps %xmm1, %xmm4
    addps %xmm4, %xmm5
    addq $0x4, %rsi           # manual pointer arithmetic simplifies addressing

    movss (%rsi), %xmm4
    shufps $0x0, %xmm4, %xmm4
    mulps %xmm2, %xmm4        # actual computation happens here
    addps %xmm4, %xmm5        #
    addq $0x4, %rsi

    movss (%rsi), %xmm4       # one mulps operand fetched per sequence
    shufps $0x0, %xmm4, %xmm4 #  |
    mulps %xmm3, %xmm4        # the other is already waiting in %xmm[0-3]
    addps %xmm4, %xmm5
    addq $0x4, %rsi           # 5 preceding comments stride among the 4 blocks

    movaps %xmm5, (%rdx,%rcx) # store the resulting row, actually, a column
    addq $0x10, %rcx          # (matrices are stored in column-major order)
    cmpq $0x40, %rcx
    jne .ROW
    ret
.size matrixMultiplyASM, .-matrixMultiplyASM

据计算每次迭代产生的矩阵的整列,通过加工包装成128位的SSE寄存器四个浮点。完整的向量化可能有位数学(操作重新排序和聚合)和 mullps / ADDPS 并行乘法指令/除了4xfloat包。在code重用意味着传递参数(%RDI %RSI 寄存器%的RDX :GNU / Linux的ABI),由(内)循环展开的好处,并拥有一个矩阵中完全XMM寄存器来降低内存读取。 A你可以看到,我研究的话题,并把我的时间来实现它尽我所能。

It calculates a whole column of the resultant matrix per iteration, by processing four floats packed in 128-bit SSE registers. The full vectorisation is possible with a bit of math (operation reordering and aggregation) and mullps/addps instructions for parallel multiplication/addition of 4xfloat packages. The code reuses registers meant for passing parameters (%rdi, %rsi, %rdx : GNU/Linux ABI), benefits from (inner) loop unrolling and holds one matrix entirely in XMM registers to reduce memory reads. A you can see, I have researched the topic and took my time to implement it the best I can.

天真Ç计算征服我的code是这样的:

The naive C calculation conquering my code looks like this:

void matrixMultiplyNormal(mat4_t *mat_a, mat4_t *mat_b, mat4_t *mat_r) {
    for (unsigned int i = 0; i < 16; i += 4)
        for (unsigned int j = 0; j < 4; ++j)
            mat_r->m[i + j] = (mat_b->m[i + 0] * mat_a->m[j +  0])
                            + (mat_b->m[i + 1] * mat_a->m[j +  4])
                            + (mat_b->m[i + 2] * mat_a->m[j +  8])
                            + (mat_b->m[i + 3] * mat_a->m[j + 12]);
}

我已经调查了上面的C code的优化的汇编输出,同时存储在XMM寄存器的花车,的不涉及任何并行操作的 - 只是标量计算,指针运算和条件跳转。编译器的code似乎不那么刻意,但它仍然是比我预计约4倍快向量化版本稍微更有效。我敢肯定,总的想法是正确的 - 程序员做类似的事情与奖励结果。但是,什么是错在这里?是否有任何寄存器分配或指令调度的问题,我不知道呢?你知道有x86-64的装配工具或技巧来支持我对机器的战斗?

I have investigated the optimised assembly output of the above's C code which, while storing floats in XMM registers, does not involve any parallel operations – just scalar calculations, pointer arithmetic and conditional jumps. The compiler's code seems to be less deliberate, but it is still slightly more effective than my vectorised version expected to be about 4x faster. I'm sure that the general idea is correct – programmers do similar things with rewarding results. But what is wrong here? Are there any register allocation or instruction scheduling issues I am not aware of? Do you know any x86-64 assembly tools or tricks to support my battle against the machine?

推荐答案

有一种方法可以加速code和实力超群的编译器。它不涉及任何复杂的管道分析或深code微型优化(这并不意味着这些,它不能进一步受益)。优化使用三个简单的技巧:

There is a way to accelerate the code and outplay the compiler. It does not involve any sophisticated pipeline analysis or deep code micro-optimisation (which doesn't mean that it couldn't further benefit from these). The optimisation uses three simple tricks:


  1. 功能是现在的32字节对​​齐(其中显著提升性能),

  1. The function is now 32-byte aligned (which significantly boosted performance),

主回路去成反比,从而降低了比较为零测试(根据EFLAGS)

Main loop goes inversely, which reduces comparison to a zero test (based on EFLAGS),

指令级地址算术证明比外部指针计算(即使它需要两倍«在3/4例»之多加法)更快。它通过四条指令缩短循环体和其执行路径中减少数据依赖。 <一href=\"http://stackoverflow.com/questions/18552677/how-does-address-operand-affect-performance-and-size-of-machine-$c$c\">See相关的问题。

Instruction-level address arithmetic proved to be faster than the "external" pointer calculation (even though it requires twice as much additions «in 3/4 cases»). It shortened the loop body by four instructions and reduced data dependencies within its execution path. See related question.

此外,code使用这燮presses符号重定义错误相对跳转语法,当GCC尝试内联它(被放置在 ASM 语句,并编译 -O3 )。

Additionally, the code uses a relative jump syntax which suppresses symbol redefinition error, which occurs when GCC tries to inline it (after being placed within asm statement and compiled with -O3).

    .text
    .align 32                           # 1. function entry alignment
    .globl matrixMultiplyASM            #    (for a faster call)
    .type matrixMultiplyASM, @function
matrixMultiplyASM:
    movaps   (%rdi), %xmm0
    movaps 16(%rdi), %xmm1
    movaps 32(%rdi), %xmm2
    movaps 48(%rdi), %xmm3
    movq $48, %rcx                      # 2. loop reversal
1:                                      #    (for simpler exit condition)
    movss (%rsi, %rcx), %xmm4           # 3. extended address operands
    shufps $0, %xmm4, %xmm4             #    (faster than pointer calculation)
    mulps %xmm0, %xmm4
    movaps %xmm4, %xmm5
    movss 4(%rsi, %rcx), %xmm4
    shufps $0, %xmm4, %xmm4
    mulps %xmm1, %xmm4
    addps %xmm4, %xmm5
    movss 8(%rsi, %rcx), %xmm4
    shufps $0, %xmm4, %xmm4
    mulps %xmm2, %xmm4
    addps %xmm4, %xmm5
    movss 12(%rsi, %rcx), %xmm4
    shufps $0, %xmm4, %xmm4
    mulps %xmm3, %xmm4
    addps %xmm4, %xmm5
    movaps %xmm5, (%rdx, %rcx)
    subq $16, %rcx                      # one 'sub' (vs 'add' & 'cmp')
    jge 1b                              # SF=OF, idiom: jump if positive
    ret

这是最快的X86-64实施到目前为止,我所看到的。我将AP preciate,投票和接受任何答案为此目的提供更快的组装件的!

This is the fastest x86-64 implementation I have seen so far. I will appreciate, vote up and accept any answer providing a faster piece of assembly for that purpose!

这篇关于高效的4x4矩阵乘法(C VS组装)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆