高效的 4x4 矩阵乘法(C 与汇编) [英] Efficient 4x4 matrix multiplication (C vs assembly)

查看:94
本文介绍了高效的 4x4 矩阵乘法(C 与汇编)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种更快、更棘手的方法来在 C 中将两个 4x4 矩阵相乘.我目前的研究重点是带有 SIMD 扩展的 x86-64 汇编.到目前为止,我已经创建了一个比简单的 C 实现快 6 倍的函数,这超出了我对性能改进的预期.不幸的是,这仅在没有使用优化标志进行编译时才成立(GCC 4.7).使用 -O2,C 变得更快,而我的努力变得毫无意义.

I'm looking for a faster and trickier way to multiply two 4x4 matrices in C. My current research is focused on x86-64 assembly with SIMD extensions. So far, I've created a function witch is about 6x faster than a naive C implementation, which has exceeded my expectations for the performance improvement. Unfortunately, this stays true only when no optimization flags are used for compilation (GCC 4.7). With -O2, C becomes faster and my effort becomes meaningless.

我知道现代编译器利用复杂的优化技术来实现几乎完美的代码,通常比手工制作的巧妙片段更快.但是在少数对性能至关重要的情况下,人们可能会尝试与编译器争夺时钟周期.特别是,当可以探索一些以现代 ISA 为支持的数学时(就像我的情况一样).

I know that modern compilers make use of complex optimization techniques to achieve an almost perfect code, usually faster than an ingenious piece of hand-crafed assembly. But in a minority of performance-critical cases, a human may try to fight for clock cycles with the compiler. Especially, when some mathematics backed with a modern ISA can be explored (as it is in my case).

我的函数如下(AT&T 语法,GNU Assembler):

My function looks as follows (AT&T syntax, GNU Assembler):

    .text
    .globl matrixMultiplyASM
    .type matrixMultiplyASM, @function
matrixMultiplyASM:
    movaps   (%rdi), %xmm0    # fetch the first matrix (use four registers)
    movaps 16(%rdi), %xmm1
    movaps 32(%rdi), %xmm2
    movaps 48(%rdi), %xmm3
    xorq %rcx, %rcx           # reset (forward) loop iterator
.ROW:
    movss (%rsi), %xmm4       # Compute four values (one row) in parallel:
    shufps $0x0, %xmm4, %xmm4 # 4x 4FP mul's, 3x 4FP add's 6x mov's per row,
    mulps %xmm0, %xmm4        # expressed in four sequences of 5 instructions,
    movaps %xmm4, %xmm5       # executed 4 times for 1 matrix multiplication.
    addq $0x4, %rsi

    movss (%rsi), %xmm4       # movss + shufps comprise _mm_set1_ps intrinsic
    shufps $0x0, %xmm4, %xmm4 #
    mulps %xmm1, %xmm4
    addps %xmm4, %xmm5
    addq $0x4, %rsi           # manual pointer arithmetic simplifies addressing

    movss (%rsi), %xmm4
    shufps $0x0, %xmm4, %xmm4
    mulps %xmm2, %xmm4        # actual computation happens here
    addps %xmm4, %xmm5        #
    addq $0x4, %rsi

    movss (%rsi), %xmm4       # one mulps operand fetched per sequence
    shufps $0x0, %xmm4, %xmm4 #  |
    mulps %xmm3, %xmm4        # the other is already waiting in %xmm[0-3]
    addps %xmm4, %xmm5
    addq $0x4, %rsi           # 5 preceding comments stride among the 4 blocks

    movaps %xmm5, (%rdx,%rcx) # store the resulting row, actually, a column
    addq $0x10, %rcx          # (matrices are stored in column-major order)
    cmpq $0x40, %rcx
    jne .ROW
    ret
.size matrixMultiplyASM, .-matrixMultiplyASM

它通过处理封装在 128 位 SSE 寄存器中的四个浮点数,计算每次迭代所得矩阵的一整列.完全矢量化可以通过一些数学运算(操作重新排序和聚合)和 mullps/addps 指令来实现 4xfloat 包的并行乘法/加法.代码重用了用于传递参数的寄存器(%rdi%rsi%rdx:GNU/Linux ABI),受益于(内部)循环展开并将一个矩阵完全保存在 XMM 寄存器中以减少内存读取.你可以看到,我已经研究了这个主题,并花时间尽我所能地实施它.

It calculates a whole column of the resultant matrix per iteration, by processing four floats packed in 128-bit SSE registers. The full vectorisation is possible with a bit of math (operation reordering and aggregation) and mullps/addps instructions for parallel multiplication/addition of 4xfloat packages. The code reuses registers meant for passing parameters (%rdi, %rsi, %rdx : GNU/Linux ABI), benefits from (inner) loop unrolling and holds one matrix entirely in XMM registers to reduce memory reads. A you can see, I have researched the topic and took my time to implement it the best I can.

征服我的代码的简单 C 计算如下所示:

The naive C calculation conquering my code looks like this:

void matrixMultiplyNormal(mat4_t *mat_a, mat4_t *mat_b, mat4_t *mat_r) {
    for (unsigned int i = 0; i < 16; i += 4)
        for (unsigned int j = 0; j < 4; ++j)
            mat_r->m[i + j] = (mat_b->m[i + 0] * mat_a->m[j +  0])
                            + (mat_b->m[i + 1] * mat_a->m[j +  4])
                            + (mat_b->m[i + 2] * mat_a->m[j +  8])
                            + (mat_b->m[i + 3] * mat_a->m[j + 12]);
}

我研究了上述 C 代码的优化汇编输出,当将浮点数存储在 XMM 寄存器中时,不涉及任何并行操作——只是标量计算、指针算术和条件跳转.编译器的代码似乎不那么刻意,但它仍然比我预期的矢量化版本快 4 倍更有效.我确信总体思路是正确的——程序员做类似的事情并获得有益的结果.但是这里有什么问题呢?是否有任何我不知道的寄存器分配或指令调度问题?你知道任何 x86-64 组装工具或技巧来支持我与机器的战斗吗?

I have investigated the optimised assembly output of the above's C code which, while storing floats in XMM registers, does not involve any parallel operations – just scalar calculations, pointer arithmetic and conditional jumps. The compiler's code seems to be less deliberate, but it is still slightly more effective than my vectorised version expected to be about 4x faster. I'm sure that the general idea is correct – programmers do similar things with rewarding results. But what is wrong here? Are there any register allocation or instruction scheduling issues I am not aware of? Do you know any x86-64 assembly tools or tricks to support my battle against the machine?

推荐答案

有一种方法可以加速代码并胜过编译器.它不涉及任何复杂的管道分析或深度代码微优化(这并不意味着它不能从这些中进一步受益).优化使用三个简单的技巧:

There is a way to accelerate the code and outplay the compiler. It does not involve any sophisticated pipeline analysis or deep code micro-optimisation (which doesn't mean that it couldn't further benefit from these). The optimisation uses three simple tricks:

  1. 该函数现在是 32 字节对齐的(这显着提高了性能),

  1. The function is now 32-byte aligned (which significantly boosted performance),

主循环相反,这减少了与零测试的比较(基于 EFLAGS),

Main loop goes inversely, which reduces comparison to a zero test (based on EFLAGS),

指令级地址算术被证明比外部"指针计算更快(即使它需要两倍的加法在 3/4 情况下").它通过四个指令缩短了循环体并减少了其执行路径中的数据依赖性.查看相关问题.p>

Instruction-level address arithmetic proved to be faster than the "external" pointer calculation (even though it requires twice as much additions «in 3/4 cases»). It shortened the loop body by four instructions and reduced data dependencies within its execution path. See related question.

此外,代码使用了相对跳转语法来抑制符号重定义错误,该错误发生在 GCC 尝试内联它时(在被放置在 asm 语句中并使用 -O3-O3 编译后)代码>).

Additionally, the code uses a relative jump syntax which suppresses symbol redefinition error, which occurs when GCC tries to inline it (after being placed within asm statement and compiled with -O3).

    .text
    .align 32                           # 1. function entry alignment
    .globl matrixMultiplyASM            #    (for a faster call)
    .type matrixMultiplyASM, @function
matrixMultiplyASM:
    movaps   (%rdi), %xmm0
    movaps 16(%rdi), %xmm1
    movaps 32(%rdi), %xmm2
    movaps 48(%rdi), %xmm3
    movq $48, %rcx                      # 2. loop reversal
1:                                      #    (for simpler exit condition)
    movss (%rsi, %rcx), %xmm4           # 3. extended address operands
    shufps $0, %xmm4, %xmm4             #    (faster than pointer calculation)
    mulps %xmm0, %xmm4
    movaps %xmm4, %xmm5
    movss 4(%rsi, %rcx), %xmm4
    shufps $0, %xmm4, %xmm4
    mulps %xmm1, %xmm4
    addps %xmm4, %xmm5
    movss 8(%rsi, %rcx), %xmm4
    shufps $0, %xmm4, %xmm4
    mulps %xmm2, %xmm4
    addps %xmm4, %xmm5
    movss 12(%rsi, %rcx), %xmm4
    shufps $0, %xmm4, %xmm4
    mulps %xmm3, %xmm4
    addps %xmm4, %xmm5
    movaps %xmm5, (%rdx, %rcx)
    subq $16, %rcx                      # one 'sub' (vs 'add' & 'cmp')
    jge 1b                              # SF=OF, idiom: jump if positive
    ret

这是迄今为止我见过的最快的 x86-64 实现.我将不胜感激、投票并接受任何为此目的提供更快组装的答案!

This is the fastest x86-64 implementation I have seen so far. I will appreciate, vote up and accept any answer providing a faster piece of assembly for that purpose!

这篇关于高效的 4x4 矩阵乘法(C 与汇编)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆