GCC优化了基于固定范围的循环,就好像它具有更长的可变长度 [英] GCC optimizes fixed range-based for loop as if it had longer, variable length

查看:152
本文介绍了GCC优化了基于固定范围的循环,就好像它具有更长的可变长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组POD结构,并试图在一个字段中求和。这是一个简单的例子:

  struct Item 
{
int x = 0;
int y = 0;
};

typedef商品项目[2];

struct ItemArray
{
Items items;

int sum_x1()const;
int sum_x2()const;
};

int ItemArray :: sum_x1()const
{
int total = 0;
for(unsigned ii = 0; ii <2; ++ ii)
{
total + = items [ii] .x;
}
总回报;
}

int ItemArray :: sum_x2()const
{
int total = 0;
for(const Item& item:items)
{
total + = item.x;
}
总回报;
}

两个和函数做同样的事情。铿锵将它们编成一致。但是x86_64上的 -O3 的GCC 6没有。这是 sum_x1(),看起来不错:

  mov eax,DWORD PTR [rdi + 8] 
add eax,DWORD PTR [rdi]
ret

现在看看 sum_x2()

  lea rdx,[rdi +16] 
lea rcx,[rdi + 8]
xor eax,eax
add eax,DWORD PTR [rdi]
cmp rdx,rcx
je .L12
lea rcx,[rdi + 16]
add eax,DWORD PTR [rdi + 8]
cmp rdx,rcx
je .L2
lea rcx,[rdi +24]
add eax,DWORD PTR [rdi + 16]
cmp rdx,rcx
je .L2
lea rcx,[rdi + 32]
add eax ,DWORD PTR [rdi + 24]
cmp rdx,rcx
je .L2
lea rcx,[rdi + 40]
add eax,DWORD PTR [rdi + 32]
cmp rdx,rcx
je .L2
lea rcx,[rdi + 48]
add eax,DWORD PTR [rdi + 40]
cmp rdx,rcx
je .L2
lea rcx,[rdi + 56]
add eax,DWORD PTR [rdi + 48]
cmp rdx,rcx
je .L2
lea rcx,[rdi + 64]
add eax,DWORD PTR [rdi + 56]
cmp rdx,rcx
je .L2
lea rcx,[rdi + 72]
add eax,DWORD PTR [rdi + 64]
cmp rdx,rcx
je .L2
add eax,DWORD PTR [rdi + 72]
ret
.L2:
rep ret
.L12:
rep ret

当循环长度固定为2时,为什么GCC会发出长度可达10的展开循环?它只在一个成员函数中做到这一点 - 使得 sum_x2 一个免费函数修复它。



ICC还优化 sum_x2()非常奇怪,尽管生成的代码完全不同。与GCC不同的是, sum_x2()是一个成员函数还是一个自由函数 - 都不好。



我正在使用GCC 6,但所有版本的GCC似乎都遇到了此代码的问题。添加 -march = haswell 使其更糟糕,在大小为2的数组中添加多达15个元素的迭代.GCC 5和7生成更复杂的代码,添加SIMD指令。



我想确定此问题的确切原因,以便我可以在代码中找到并修复类似事件。了解什么在GCC 6中触发这种行为将会非常有帮助。我的代码中有很多基于范围的for循环,我并不兴奋去除它们,但如果GCC无法生成合理的代码,我将别无选择。



试试看: https://godbolt.org/g/9GK4jy



更多相关的精神错乱: https://godbolt.org/g/BGYggD (最佳代码是3条指令; GCC 6产生8条指令; GCC 7产生130条指令)解决方案

Richard Biener在我的错误报告中,似乎是这个问题版本8之前的GCC未能理解类或结构的字段作为常规变量经受相同的优化(例如,恒定循环计数)。因此,如果容器是一个成员变量,它会发出各种各样的幻想代码,以最佳的方式循环未知次数,即使它在编译时已知。



根据我的理解,这个bug可能会影响到相当多的代码 - 例如在任何地方,一个成员小数组是基于C ++ 11范围for循环的主题。



感谢Richard Biener提供的解决方案(针对GCC 8)。

I have an array of POD structs and am trying to sum across one field. Here's a minimal example:

struct Item
{
    int x = 0;
    int y = 0;
};

typedef Item Items[2];

struct ItemArray
{
    Items items;

    int sum_x1() const;
    int sum_x2() const;
};

int ItemArray::sum_x1() const
{
    int total = 0;
    for (unsigned ii = 0; ii < 2; ++ii)
    {
        total += items[ii].x;
    }
    return total;
}

int ItemArray::sum_x2() const
{
    int total = 0;
    for (const Item& item : items)
    {
        total += item.x;
    }
    return total;
}

The two sum functions do the same thing. Clang compiles them identically. But GCC 6 with -O3 on x86_64 does not. Here's sum_x1(), looking good:

  mov eax, DWORD PTR [rdi+8]
  add eax, DWORD PTR [rdi]
  ret

Now look at sum_x2():

  lea rdx, [rdi+16]
  lea rcx, [rdi+8]
  xor eax, eax
  add eax, DWORD PTR [rdi]
  cmp rdx, rcx
  je .L12
  lea rcx, [rdi+16]
  add eax, DWORD PTR [rdi+8]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+24]
  add eax, DWORD PTR [rdi+16]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+32]
  add eax, DWORD PTR [rdi+24]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+40]
  add eax, DWORD PTR [rdi+32]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+48]
  add eax, DWORD PTR [rdi+40]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+56]
  add eax, DWORD PTR [rdi+48]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+64]
  add eax, DWORD PTR [rdi+56]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+72]
  add eax, DWORD PTR [rdi+64]
  cmp rdx, rcx
  je .L2
  add eax, DWORD PTR [rdi+72]
  ret
.L2:
  rep ret
.L12:
  rep ret

Why does GCC emit an unrolled loop of variable length up to 10 when there are the loop length is fixed at 2? It only does this in a member function--making sum_x2 a free function fixes it.

ICC also optimizes sum_x2() very strangely, though the generated code is totally different. Unlike GCC, it doesn't matter whether sum_x2() is a member function or a free function--both are bad.

I'm using GCC 6, but all versions of GCC seem to have problems with this code. Adding -march=haswell makes it even worse, adding iterations for up to 15 elements in the array of size 2. GCC 5 and 7 generate even more complex code, adding SIMD instructions.

I would like to identify the exact cause of this problem, so that I can locate and fix similar occurrences in my code. Understanding what triggers this behavior in GCC 6 would be very helpful. I have lots of range-based for loops in my code, and I am not too excited at the prospect of removing them, but if GCC can't generate reasonable code I will have no choice.

Try it: https://godbolt.org/g/9GK4jy

More related insanity: https://godbolt.org/g/BGYggD (optimal code is 3 instructions; GCC 6 produces 8 instructions; GCC 7 produces 130 instructions)

解决方案

As described by Richard Biener in my bug report, the problem seems to be that GCC prior to version 8 failed to understand that a field of a class or struct was subject to the same optimizations (e.g. constant loop count) as a regular variable. So it would emit all sorts of fancy code to optimally loop an unknown number of times, even when it was known at compile time, in the case where the container was a member variable.

The way I understand it, this bug probably affects quite a bit of code in the wild--e.g. anywhere a member small array is the subject of a C++11 range-based for loop.

Thanks to Richard Biener for the prompt resolution (targeted for GCC 8).

这篇关于GCC优化了基于固定范围的循环,就好像它具有更长的可变长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆