循环展开优化,这是如何工作的 [英] Loop unrolling optimization, how does this work

查看:160
本文介绍了循环展开优化,这是如何工作的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下C代码:

int sum=0;
for(int i=0;i<5;i++)
    sum+=i;

这可以这样(在不展开循环的情况下)在(伪)程序集中翻译:

This could be translated in (pseudo-) assembly this way (without loop unrolling):

% pseudo-code assembly
ADDI $R10, #0   % sum
ADDI $R11, #0   % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
BNE $R11, #5 LOOP

所以我的第一个问题是如何在以下两种方式之间使用循环展开来翻译此代码:

So my first question is how is this code translated using loop unrolling, between these two ways:

1)

ADDI $R10, #0
ADDI $R10, #0
ADDI $R10, #1
ADDI $R10, #2
ADDI $R10, #3
ADDI $R10, #4

2)

   ADD $R10, #10

编译器是否能够优化代码并直接知道它必须在不执行所有求和的情况下加10?

Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?

还有,是否有可能用分支指令阻塞管道?我必须这样写吗?

Also, is there a possibility to block the pipeline with a branch instruction? Do I have to write it this way:

% pseudo-code assembly
ADDI $R10, #0   % sum
ADDI $R11, #0   % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
NOP   % is this necessary to avoid the pipeline blocking?
NOP
NOP
NOP
BNE $R11, #5 LOOP

为避免fetch-decode-exe-mem-write回写周期被分支中断?

To avoid that the fetch-decode-exe-mem-write back cycle is interrupted by the branch?

推荐答案

这更多是为了演示编译器的功能,而不是每个编译器的功能.来源:

This is more for demonstration of what a compiler is capable of, rather than what every compiler would do. The source:

#include <stdio.h>

int main(void)
{
    int i, sum = 0;

    for(i=0; i<5; i++) {
        sum+=i;
    }

    printf("%d\n", sum);
    return 0;
}

请注意我添加的printf.如果未使用该变量,则编译器将优化整个循环.

Note the printf I have added. If the variable is not used, the compiler will optimize out the entire loop.

使用-O0编译(无优化)

gcc -Wall -O0 -S -c lala.c:

.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)
    jle .L3

循环以哑"方式发生,其中-8(%rbp)是变量i.

The loop happens in a 'dumb' way, with -8(%rbp) being the variable i.

使用-O1(优化级别1)进行编译

gcc -Wall -O1 -S -c lala.c:

movl    $10, %edx

该循环已被完全删除,并替换为等效值.

The loop has been completely removed and replaced with the equivalent value.

在展开时,编译器将查看将发生多少次迭代,并尝试通过执行较少的迭代来展开.例如,循环体可能会重复两次,这将导致分支数量减半.在C中是这样的情况:

In unrolling, the compiler looks to see how many iterations would happen and tries to unroll by performing less iterations. For example, the loop body might be duplicated twice which would result in the number of branches to be halved. Such a case in C:

int i = 0, sum = 0;

sum += i;
i++;

for(; i<5;i++) {
    sum+=i;
    i++;
    sum+=i;
}

请注意,必须从循环中提取一次迭代.这是因为5是奇数,因此不能简单地通过复制内容将工作减半.在这种情况下,循环将仅输入两次. -O0产生的汇编代码:

Notice that one iteration had to be extracted out of the loop. This is because 5 is an odd number and so the work can not simply be halved by duplicating the contents. In this case the loop will only be entered twice. The assembly code produced by -O0:

    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L2
.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)

完全在C中展开:

for(i=0; i<5;i++) {
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
}

这一次实际上只输入了一次循环.用-O0:

This time the loop is actually entered only once. The assembly produced with -O0:

.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)
    jle .L3

这篇关于循环展开优化,这是如何工作的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆