GCC 5和更高版本中对AVX2的支持 [英] AVX2 support in GCC 5 and later

查看:127
本文介绍了GCC 5和更高版本中对AVX2的支持的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了下面的"T"类来加速对 使用AVX2的字符集".然后我发现它在 gcc 5和更高版本,当我使用"-O3"时. 谁能帮我追溯到一些编程结构, 已知在最新的编译器/系统上不起作用?

此代码的工作方式:基础结构("_bits")是一个256字节的块(已为AVX2对齐并分配),可以将其作为char [256]或AVX2元素进行访问,具体取决于元素是否为访问或整个操作都在矢量操作中使用.似乎它在AVX2平台上应该可以很好地工作.不?

这真的很难调试,因为"valgrind"说它很干净, 而且我不能使用调试器(由于问题在 我删除"-O3").但是我不满意仅使用"| =" 解决方法,因为如果此代码确实是错误的,那么我可能 在其他地方犯同样的错误,搞砸了一切 我开发!

有趣的是,"|"操作员有问题,但是 "| ="不.问题是否可能与从中返回结构有关 一个功能?但我认为自1990年以来返回结构已奏效 之类的.

// g++ -std=c++11 -mavx2 -O3 gcc_fail.cpp

#include "assert.h"
#include "immintrin.h" // AVX

class T {
public:
  __m256i _bits[8];
  inline bool& operator[](unsigned char c)       {return ((bool*)_bits)[c];}
  inline bool  operator[](unsigned char c) const {return ((bool*)_bits)[c];}
  inline          T()                   {}
  inline explicit T(char const*);
  inline T     operator| (T const& b) const;
  inline T &   operator|=(T const& b);
  inline bool  operator! ()           const;
};

T::T(char const* s)
{
  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
  char c;
  while ((c = *s++))
    (*this)[c] = true;
}

T T::operator| (T const& b) const
{
  T res;
  for (int i = 0; i < 8; i++)
    res._bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);


  // FIXME why does the above code fail with -O3 in new gcc?
  for (int i=0; i<256; i++)
    assert(res[i] == ((*this)[i] || b[i]));
  // gcc 4.7.0 - PASS
  // gcc 4.7.2 - PASS
  // gcc 4.8.0 - PASS
  // gcc 4.9.2 - PASS
  // gcc 5.2.0 - FAIL
  // gcc 5.3.0 - FAIL
  // gcc 5.3.1 - FAIL
  // gcc 6.1.0 - FAIL


  return res;
}

T & T::operator|=(T const& b)
{
  for (int i = 0; i < 8; i++)
    _bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
  return *this;
}

bool T::operator! () const
{
  for (int i = 0; i < 8; i++)
    if (!_mm256_testz_si256(_bits[i], _bits[i]))
      return false;
  return true;
}

int Main()
{
  T sep (" ,\t\n");
  T end ("");
  return !(sep|end);
}

int main()
{
  return Main();
}

解决方案

您的代码的问题是,当您应该一直使用unsigned char*时使用了bool*,这使GCC 5可以进行指针别名优化. /p>

由GCC 4.8.5和5.3.1生成的功能Main()的两个机器代码转储位于附录的此答案的末尾,以供参考.

查看代码:

反编译

序言之后,T sep_bits初始化为零...

  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);

  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)

,然后基于char* s循环写入.

  char c;
  while ((c = *s++))
    (*this)[c] = true;

  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>

两个编译器然后将T end初始化为0:

  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)

然后两个编译器都会优化_mm256_or_si256()操作,因为已知T end0.但是,然后,GCC 4.8.5 T sep复制到T res (这是计算结果,当您将任何值或零变量或为零时,会发生这种情况),而GCC 5.3.1 会初始化0 .之所以有权这样做,是因为在您的operator []方法中,将类型为__m256i*的指针强制转换为bool*,并且允许编译器假定这些指针没有别名.因此,在GCC 4.8.5中,您会看到

  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)

在GCC 5.3.1中,您会看到

  4006fa:       c5 fd 7f 85 f0 fe ff ff         vmovdqa %ymm0,-0x110(%rbp)
  400702:       c5 fd 7f 85 10 ff ff ff         vmovdqa %ymm0,-0xf0(%rbp)
  40070a:       c5 fd 7f 85 30 ff ff ff         vmovdqa %ymm0,-0xd0(%rbp)
  400712:       c5 fd 7f 85 50 ff ff ff         vmovdqa %ymm0,-0xb0(%rbp)
  40071a:       c5 fd 7f 85 70 ff ff ff         vmovdqa %ymm0,-0x90(%rbp)
  400722:       c5 fd 7f 45 90                  vmovdqa %ymm0,-0x70(%rbp)
  400727:       c5 fd 7f 45 b0                  vmovdqa %ymm0,-0x50(%rbp)
  40072c:       c5 fd 7f 45 d0                  vmovdqa %ymm0,-0x30(%rbp)

因此,对assert()的读取将失败.

标准关于指针别名的规定:

ISO C ++ 11在以下部分中引用了别名,这明确说明类型__m256i*的变量无法使用bool*进行访问,但可以使用char*/unsigned char*进行访问:

§3.10左值和右值[basic.lval]

[...]

如果程序尝试通过以下类型之一以外的glvalue访问对象的存储值,则行为未定义:[52]

  • 对象的动态类型,
  • 对象的动态类型的cv限定版本,
  • 类似于对象的动态类型的类型(定义见4.4)
  • 一种类型,它是与对象的动态类型相对应的有符号或无符号类型,
  • 一种类型,它是与对象的动态类型的cv限定版本相对应的有符号或无符号类型,
  • 在其元素或非静态数据成员(包括递归地包括子聚合或所包含的并集的元素或非静态数据成员)中包括上述类型之一的集合或联合类型,
  • 一种类型,它是对象动态类型的(可能是cv限定的)基类类型,
  • charunsigned char类型.

52)此列表的目的是指定对象可能会别名也可能不会别名的那些情况.

附录

GCC 4.8.5:

0000000000400620 <_Z4Mainv>:
  400620:       55                              push   %rbp
  400621:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400625:       ba e5 08 40 00                  mov    $0x4008e5,%edx
  40062a:       b8 20 00 00 00                  mov    $0x20,%eax
  40062f:       48 89 e5                        mov    %rsp,%rbp
  400632:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400636:       48 81 ec 00 03 00 00            sub    $0x300,%rsp
  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)
  400678:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>
  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)
  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)
  400761:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400768:       80 3c 04 00                     cmpb   $0x0,(%rsp,%rax,1)
  40076c:       0f b6 8c 04 00 02 00 00         movzbl 0x200(%rsp,%rax,1),%ecx
  400774:       ba 01 00 00 00                  mov    $0x1,%edx
  400779:       75 08                           jne    400783 <_Z4Mainv+0x163>
  40077b:       0f b6 94 04 00 01 00 00         movzbl 0x100(%rsp,%rax,1),%edx
  400783:       38 d1                           cmp    %dl,%cl
  400785:       0f 85 b2 00 00 00               jne    40083d <_Z4Mainv+0x21d>
  40078b:       48 83 c0 01                     add    $0x1,%rax
  40078f:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400795:       75 d1                           jne    400768 <_Z4Mainv+0x148>
  400797:       c5 fd 6f 8c 24 00 02 00 00      vmovdqa 0x200(%rsp),%ymm1
  4007a0:       31 c0                           xor    %eax,%eax
  4007a2:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007a7:       0f 94 c0                        sete   %al
  4007aa:       0f 85 88 00 00 00               jne    400838 <_Z4Mainv+0x218>
  4007b0:       c5 fd 6f 8c 24 20 02 00 00      vmovdqa 0x220(%rsp),%ymm1
  4007b9:       31 c0                           xor    %eax,%eax
  4007bb:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007c0:       0f 94 c0                        sete   %al
  4007c3:       75 73                           jne    400838 <_Z4Mainv+0x218>
  4007c5:       c5 fd 6f 8c 24 40 02 00 00      vmovdqa 0x240(%rsp),%ymm1
  4007ce:       31 c0                           xor    %eax,%eax
  4007d0:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007d5:       0f 94 c0                        sete   %al
  4007d8:       75 5e                           jne    400838 <_Z4Mainv+0x218>
  4007da:       c5 fd 6f 8c 24 60 02 00 00      vmovdqa 0x260(%rsp),%ymm1
  4007e3:       31 c0                           xor    %eax,%eax
  4007e5:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ea:       0f 94 c0                        sete   %al
  4007ed:       75 49                           jne    400838 <_Z4Mainv+0x218>
  4007ef:       c5 fd 6f 8c 24 80 02 00 00      vmovdqa 0x280(%rsp),%ymm1
  4007f8:       31 c0                           xor    %eax,%eax
  4007fa:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ff:       0f 94 c0                        sete   %al
  400802:       75 34                           jne    400838 <_Z4Mainv+0x218>
  400804:       c5 fd 6f 8c 24 a0 02 00 00      vmovdqa 0x2a0(%rsp),%ymm1
  40080d:       31 c0                           xor    %eax,%eax
  40080f:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400814:       0f 94 c0                        sete   %al
  400817:       75 1f                           jne    400838 <_Z4Mainv+0x218>
  400819:       c5 fd 6f 8c 24 c0 02 00 00      vmovdqa 0x2c0(%rsp),%ymm1
  400822:       31 c0                           xor    %eax,%eax
  400824:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400829:       0f 94 c0                        sete   %al
  40082c:       75 0a                           jne    400838 <_Z4Mainv+0x218>
  40082e:       31 c0                           xor    %eax,%eax
  400830:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  400835:       0f 94 c0                        sete   %al
  400838:       c5 f8 77                        vzeroupper 
  40083b:       c9                              leaveq 
  40083c:       c3                              retq   
  40083d:       b9 20 09 40 00                  mov    $0x400920,%ecx
  400842:       ba 26 00 00 00                  mov    $0x26,%edx
  400847:       be e9 08 40 00                  mov    $0x4008e9,%esi
  40084c:       bf f8 08 40 00                  mov    $0x4008f8,%edi
  400851:       c5 f8 77                        vzeroupper 
  400854:       e8 97 fc ff ff                  callq  4004f0 <__assert_fail@plt>
  400859:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)

海湾合作委员会5:

0000000000400630 <_Z4Mainv>:
  400630:       4c 8d 54 24 08                  lea    0x8(%rsp),%r10
  400635:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400639:       b8 20 00 00 00                  mov    $0x20,%eax
  40063e:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400642:       ba 25 08 40 00                  mov    $0x400825,%edx
  400647:       41 ff 72 f8                     pushq  -0x8(%r10)
  40064b:       55                              push   %rbp
  40064c:       48 89 e5                        mov    %rsp,%rbp
  40064f:       41 52                           push   %r10
  400651:       48 81 ec 08 03 00 00            sub    $0x308,%rsp
  400658:       c5 fd 7f 85 50 fd ff ff         vmovdqa %ymm0,-0x2b0(%rbp)
  400660:       c5 fd 7f 85 30 fd ff ff         vmovdqa %ymm0,-0x2d0(%rbp)
  400668:       c5 fd 7f 85 10 fd ff ff         vmovdqa %ymm0,-0x2f0(%rbp)
  400670:       c5 fd 7f 85 f0 fc ff ff         vmovdqa %ymm0,-0x310(%rbp)
  400678:       c5 fd 7f 85 d0 fd ff ff         vmovdqa %ymm0,-0x230(%rbp)
  400680:       c5 fd 7f 85 b0 fd ff ff         vmovdqa %ymm0,-0x250(%rbp)
  400688:       c5 fd 7f 85 90 fd ff ff         vmovdqa %ymm0,-0x270(%rbp)
  400690:       c5 fd 7f 85 70 fd ff ff         vmovdqa %ymm0,-0x290(%rbp)
  400698:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  4006a0:       48 83 c2 01                     add    $0x1,%rdx
  4006a4:       c6 84 05 f0 fc ff ff 01         movb   $0x1,-0x310(%rbp,%rax,1)
  4006ac:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  4006b0:       84 c0                           test   %al,%al
  4006b2:       75 ec                           jne    4006a0 <_Z4Mainv+0x70>
  4006b4:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  4006b8:       31 c0                           xor    %eax,%eax
  4006ba:       c5 fd 7f 85 50 fe ff ff         vmovdqa %ymm0,-0x1b0(%rbp)
  4006c2:       c5 fd 7f 85 30 fe ff ff         vmovdqa %ymm0,-0x1d0(%rbp)
  4006ca:       c5 fd 7f 85 10 fe ff ff         vmovdqa %ymm0,-0x1f0(%rbp)
  4006d2:       c5 fd 7f 85 f0 fd ff ff         vmovdqa %ymm0,-0x210(%rbp)
  4006da:       c5 fd 7f 85 d0 fe ff ff         vmovdqa %ymm0,-0x130(%rbp)
  4006e2:       c5 fd 7f 85 b0 fe ff ff         vmovdqa %ymm0,-0x150(%rbp)
  4006ea:       c5 fd 7f 85 90 fe ff ff         vmovdqa %ymm0,-0x170(%rbp)
  4006f2:       c5 fd 7f 85 70 fe ff ff         vmovdqa %ymm0,-0x190(%rbp)
  4006fa:       c5 fd 7f 85 f0 fe ff ff         vmovdqa %ymm0,-0x110(%rbp)
  400702:       c5 fd 7f 85 10 ff ff ff         vmovdqa %ymm0,-0xf0(%rbp)
  40070a:       c5 fd 7f 85 30 ff ff ff         vmovdqa %ymm0,-0xd0(%rbp)
  400712:       c5 fd 7f 85 50 ff ff ff         vmovdqa %ymm0,-0xb0(%rbp)
  40071a:       c5 fd 7f 85 70 ff ff ff         vmovdqa %ymm0,-0x90(%rbp)
  400722:       c5 fd 7f 45 90                  vmovdqa %ymm0,-0x70(%rbp)
  400727:       c5 fd 7f 45 b0                  vmovdqa %ymm0,-0x50(%rbp)
  40072c:       c5 fd 7f 45 d0                  vmovdqa %ymm0,-0x30(%rbp)
  400731:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400738:       0f b6 94 05 f0 fc ff ff         movzbl -0x310(%rbp,%rax,1),%edx
  400740:       0f b6 8c 05 f0 fe ff ff         movzbl -0x110(%rbp,%rax,1),%ecx
  400748:       84 d2                           test   %dl,%dl
  40074a:       75 08                           jne    400754 <_Z4Mainv+0x124>
  40074c:       0f b6 94 05 f0 fd ff ff         movzbl -0x210(%rbp,%rax,1),%edx
  400754:       38 d1                           cmp    %dl,%cl
  400756:       75 2c                           jne    400784 <_Z4Mainv+0x154>
  400758:       48 83 c0 01                     add    $0x1,%rax
  40075c:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400762:       75 d4                           jne    400738 <_Z4Mainv+0x108>
  400764:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400768:       31 c0                           xor    %eax,%eax
  40076a:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  40076f:       0f 94 c0                        sete   %al
  400772:       c5 f8 77                        vzeroupper 
  400775:       48 81 c4 08 03 00 00            add    $0x308,%rsp
  40077c:       41 5a                           pop    %r10
  40077e:       5d                              pop    %rbp
  40077f:       49 8d 62 f8                     lea    -0x8(%r10),%rsp
  400783:       c3                              retq   
  400784:       b9 60 08 40 00                  mov    $0x400860,%ecx
  400789:       ba 26 00 00 00                  mov    $0x26,%edx
  40078e:       be 29 08 40 00                  mov    $0x400829,%esi
  400793:       bf 38 08 40 00                  mov    $0x400838,%edi
  400798:       c5 f8 77                        vzeroupper 
  40079b:       e8 50 fd ff ff                  callq  4004f0 <__assert_fail@plt>

I wrote the following class "T" to accelerate manipulations of "sets of characters" using AVX2. Then I found that it doesn't work in gcc 5 and later when I use "-O3". Can anyone help me trace this down to some programming construct that is known not to work on the latest compilers/systems?

How this code works: The underlying structure ("_bits") is a block of 256 bytes (aligned and allocated for AVX2), which can be accessed either as char[256] or AVX2 elements, depending on whether an element is accessed or the whole thing is used in a vector operation. Seems like it should work perfectly well on the AVX2 platform. No?

This is really hard to debug, because "valgrind" says it's clean, and I can't use a debugger (due to the problem disappearing when I remove "-O3"). But I am not happy with just going with the "|=" workaround because if this code is really wrong, then I'm probably making the same mistake in other places and screwing up everything I develop!

It is interesting to note that the "|" operator has the problem but the "|=" does not. Could the problem be related to returning a struct from a function? But I thought that returning a struct has worked since 1990 or something.

// g++ -std=c++11 -mavx2 -O3 gcc_fail.cpp

#include "assert.h"
#include "immintrin.h" // AVX

class T {
public:
  __m256i _bits[8];
  inline bool& operator[](unsigned char c)       {return ((bool*)_bits)[c];}
  inline bool  operator[](unsigned char c) const {return ((bool*)_bits)[c];}
  inline          T()                   {}
  inline explicit T(char const*);
  inline T     operator| (T const& b) const;
  inline T &   operator|=(T const& b);
  inline bool  operator! ()           const;
};

T::T(char const* s)
{
  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
  char c;
  while ((c = *s++))
    (*this)[c] = true;
}

T T::operator| (T const& b) const
{
  T res;
  for (int i = 0; i < 8; i++)
    res._bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);


  // FIXME why does the above code fail with -O3 in new gcc?
  for (int i=0; i<256; i++)
    assert(res[i] == ((*this)[i] || b[i]));
  // gcc 4.7.0 - PASS
  // gcc 4.7.2 - PASS
  // gcc 4.8.0 - PASS
  // gcc 4.9.2 - PASS
  // gcc 5.2.0 - FAIL
  // gcc 5.3.0 - FAIL
  // gcc 5.3.1 - FAIL
  // gcc 6.1.0 - FAIL


  return res;
}

T & T::operator|=(T const& b)
{
  for (int i = 0; i < 8; i++)
    _bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
  return *this;
}

bool T::operator! () const
{
  for (int i = 0; i < 8; i++)
    if (!_mm256_testz_si256(_bits[i], _bits[i]))
      return false;
  return true;
}

int Main()
{
  T sep (" ,\t\n");
  T end ("");
  return !(sep|end);
}

int main()
{
  return Main();
}

解决方案

Your code's problem is the use of bool* when you should have been using unsigned char*, which allowed GCC 5 to proceed with a pointer alias optimization.

The two dumps of the machine code for function Main(), produced both by GCC 4.8.5 and 5.3.1, are at the end of this answer in appendix for reference.

Looking at the code:

Decompilation

After the prologue, T sep's _bits are initialized to zero...

  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);

  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)

and then written to in a loop based on char* s.

  char c;
  while ((c = *s++))
    (*this)[c] = true;

  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>

Both compilers then initialize T end to 0s:

  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)

Both compilers then optimize out the _mm256_or_si256() operations because T end is known to be 0. But then, GCC 4.8.5 copies from T sep to T res (which is computationally what happens when you OR anything into a zero variable), while GCC 5.3.1 initializes T res to 0. It's entitled to do that because in your operator [] method you cast a pointer of type __m256i* to bool*, and the compiler is permitted to assume the pointers do not alias. Thus in GCC 4.8.5 you see

  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)

while in GCC 5.3.1 you see

  4006fa:       c5 fd 7f 85 f0 fe ff ff         vmovdqa %ymm0,-0x110(%rbp)
  400702:       c5 fd 7f 85 10 ff ff ff         vmovdqa %ymm0,-0xf0(%rbp)
  40070a:       c5 fd 7f 85 30 ff ff ff         vmovdqa %ymm0,-0xd0(%rbp)
  400712:       c5 fd 7f 85 50 ff ff ff         vmovdqa %ymm0,-0xb0(%rbp)
  40071a:       c5 fd 7f 85 70 ff ff ff         vmovdqa %ymm0,-0x90(%rbp)
  400722:       c5 fd 7f 45 90                  vmovdqa %ymm0,-0x70(%rbp)
  400727:       c5 fd 7f 45 b0                  vmovdqa %ymm0,-0x50(%rbp)
  40072c:       c5 fd 7f 45 d0                  vmovdqa %ymm0,-0x30(%rbp)

Whereupon the reads for the assert() then fail.

The Standard's Ruling on Pointer Aliasing:

ISO C++11 refers to aliasing under the following section, which makes clear that variables of type __m256i* cannot be accessed using bool*, but may be accessed with a char*/unsigned char*:

§ 3.10 Lvalues and rvalues [basic.lval]

[...]

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [52]

  • the dynamic type of the object,
  • a cv-qualified version of the dynamic type of the object,
  • a type similar (as defined in 4.4) to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
  • a char or unsigned char type.

52) The intent of this list is to specify those circumstances in which an object may or may not be aliased.

Appendix

GCC 4.8.5:

0000000000400620 <_Z4Mainv>:
  400620:       55                              push   %rbp
  400621:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400625:       ba e5 08 40 00                  mov    $0x4008e5,%edx
  40062a:       b8 20 00 00 00                  mov    $0x20,%eax
  40062f:       48 89 e5                        mov    %rsp,%rbp
  400632:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400636:       48 81 ec 00 03 00 00            sub    $0x300,%rsp
  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)
  400678:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>
  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)
  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)
  400761:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400768:       80 3c 04 00                     cmpb   $0x0,(%rsp,%rax,1)
  40076c:       0f b6 8c 04 00 02 00 00         movzbl 0x200(%rsp,%rax,1),%ecx
  400774:       ba 01 00 00 00                  mov    $0x1,%edx
  400779:       75 08                           jne    400783 <_Z4Mainv+0x163>
  40077b:       0f b6 94 04 00 01 00 00         movzbl 0x100(%rsp,%rax,1),%edx
  400783:       38 d1                           cmp    %dl,%cl
  400785:       0f 85 b2 00 00 00               jne    40083d <_Z4Mainv+0x21d>
  40078b:       48 83 c0 01                     add    $0x1,%rax
  40078f:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400795:       75 d1                           jne    400768 <_Z4Mainv+0x148>
  400797:       c5 fd 6f 8c 24 00 02 00 00      vmovdqa 0x200(%rsp),%ymm1
  4007a0:       31 c0                           xor    %eax,%eax
  4007a2:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007a7:       0f 94 c0                        sete   %al
  4007aa:       0f 85 88 00 00 00               jne    400838 <_Z4Mainv+0x218>
  4007b0:       c5 fd 6f 8c 24 20 02 00 00      vmovdqa 0x220(%rsp),%ymm1
  4007b9:       31 c0                           xor    %eax,%eax
  4007bb:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007c0:       0f 94 c0                        sete   %al
  4007c3:       75 73                           jne    400838 <_Z4Mainv+0x218>
  4007c5:       c5 fd 6f 8c 24 40 02 00 00      vmovdqa 0x240(%rsp),%ymm1
  4007ce:       31 c0                           xor    %eax,%eax
  4007d0:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007d5:       0f 94 c0                        sete   %al
  4007d8:       75 5e                           jne    400838 <_Z4Mainv+0x218>
  4007da:       c5 fd 6f 8c 24 60 02 00 00      vmovdqa 0x260(%rsp),%ymm1
  4007e3:       31 c0                           xor    %eax,%eax
  4007e5:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ea:       0f 94 c0                        sete   %al
  4007ed:       75 49                           jne    400838 <_Z4Mainv+0x218>
  4007ef:       c5 fd 6f 8c 24 80 02 00 00      vmovdqa 0x280(%rsp),%ymm1
  4007f8:       31 c0                           xor    %eax,%eax
  4007fa:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ff:       0f 94 c0                        sete   %al
  400802:       75 34                           jne    400838 <_Z4Mainv+0x218>
  400804:       c5 fd 6f 8c 24 a0 02 00 00      vmovdqa 0x2a0(%rsp),%ymm1
  40080d:       31 c0                           xor    %eax,%eax
  40080f:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400814:       0f 94 c0                        sete   %al
  400817:       75 1f                           jne    400838 <_Z4Mainv+0x218>
  400819:       c5 fd 6f 8c 24 c0 02 00 00      vmovdqa 0x2c0(%rsp),%ymm1
  400822:       31 c0                           xor    %eax,%eax
  400824:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400829:       0f 94 c0                        sete   %al
  40082c:       75 0a                           jne    400838 <_Z4Mainv+0x218>
  40082e:       31 c0                           xor    %eax,%eax
  400830:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  400835:       0f 94 c0                        sete   %al
  400838:       c5 f8 77                        vzeroupper 
  40083b:       c9                              leaveq 
  40083c:       c3                              retq   
  40083d:       b9 20 09 40 00                  mov    $0x400920,%ecx
  400842:       ba 26 00 00 00                  mov    $0x26,%edx
  400847:       be e9 08 40 00                  mov    $0x4008e9,%esi
  40084c:       bf f8 08 40 00                  mov    $0x4008f8,%edi
  400851:       c5 f8 77                        vzeroupper 
  400854:       e8 97 fc ff ff                  callq  4004f0 <__assert_fail@plt>
  400859:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)

GCC 5:

0000000000400630 <_Z4Mainv>:
  400630:       4c 8d 54 24 08                  lea    0x8(%rsp),%r10
  400635:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400639:       b8 20 00 00 00                  mov    $0x20,%eax
  40063e:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400642:       ba 25 08 40 00                  mov    $0x400825,%edx
  400647:       41 ff 72 f8                     pushq  -0x8(%r10)
  40064b:       55                              push   %rbp
  40064c:       48 89 e5                        mov    %rsp,%rbp
  40064f:       41 52                           push   %r10
  400651:       48 81 ec 08 03 00 00            sub    $0x308,%rsp
  400658:       c5 fd 7f 85 50 fd ff ff         vmovdqa %ymm0,-0x2b0(%rbp)
  400660:       c5 fd 7f 85 30 fd ff ff         vmovdqa %ymm0,-0x2d0(%rbp)
  400668:       c5 fd 7f 85 10 fd ff ff         vmovdqa %ymm0,-0x2f0(%rbp)
  400670:       c5 fd 7f 85 f0 fc ff ff         vmovdqa %ymm0,-0x310(%rbp)
  400678:       c5 fd 7f 85 d0 fd ff ff         vmovdqa %ymm0,-0x230(%rbp)
  400680:       c5 fd 7f 85 b0 fd ff ff         vmovdqa %ymm0,-0x250(%rbp)
  400688:       c5 fd 7f 85 90 fd ff ff         vmovdqa %ymm0,-0x270(%rbp)
  400690:       c5 fd 7f 85 70 fd ff ff         vmovdqa %ymm0,-0x290(%rbp)
  400698:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  4006a0:       48 83 c2 01                     add    $0x1,%rdx
  4006a4:       c6 84 05 f0 fc ff ff 01         movb   $0x1,-0x310(%rbp,%rax,1)
  4006ac:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  4006b0:       84 c0                           test   %al,%al
  4006b2:       75 ec                           jne    4006a0 <_Z4Mainv+0x70>
  4006b4:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  4006b8:       31 c0                           xor    %eax,%eax
  4006ba:       c5 fd 7f 85 50 fe ff ff         vmovdqa %ymm0,-0x1b0(%rbp)
  4006c2:       c5 fd 7f 85 30 fe ff ff         vmovdqa %ymm0,-0x1d0(%rbp)
  4006ca:       c5 fd 7f 85 10 fe ff ff         vmovdqa %ymm0,-0x1f0(%rbp)
  4006d2:       c5 fd 7f 85 f0 fd ff ff         vmovdqa %ymm0,-0x210(%rbp)
  4006da:       c5 fd 7f 85 d0 fe ff ff         vmovdqa %ymm0,-0x130(%rbp)
  4006e2:       c5 fd 7f 85 b0 fe ff ff         vmovdqa %ymm0,-0x150(%rbp)
  4006ea:       c5 fd 7f 85 90 fe ff ff         vmovdqa %ymm0,-0x170(%rbp)
  4006f2:       c5 fd 7f 85 70 fe ff ff         vmovdqa %ymm0,-0x190(%rbp)
  4006fa:       c5 fd 7f 85 f0 fe ff ff         vmovdqa %ymm0,-0x110(%rbp)
  400702:       c5 fd 7f 85 10 ff ff ff         vmovdqa %ymm0,-0xf0(%rbp)
  40070a:       c5 fd 7f 85 30 ff ff ff         vmovdqa %ymm0,-0xd0(%rbp)
  400712:       c5 fd 7f 85 50 ff ff ff         vmovdqa %ymm0,-0xb0(%rbp)
  40071a:       c5 fd 7f 85 70 ff ff ff         vmovdqa %ymm0,-0x90(%rbp)
  400722:       c5 fd 7f 45 90                  vmovdqa %ymm0,-0x70(%rbp)
  400727:       c5 fd 7f 45 b0                  vmovdqa %ymm0,-0x50(%rbp)
  40072c:       c5 fd 7f 45 d0                  vmovdqa %ymm0,-0x30(%rbp)
  400731:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400738:       0f b6 94 05 f0 fc ff ff         movzbl -0x310(%rbp,%rax,1),%edx
  400740:       0f b6 8c 05 f0 fe ff ff         movzbl -0x110(%rbp,%rax,1),%ecx
  400748:       84 d2                           test   %dl,%dl
  40074a:       75 08                           jne    400754 <_Z4Mainv+0x124>
  40074c:       0f b6 94 05 f0 fd ff ff         movzbl -0x210(%rbp,%rax,1),%edx
  400754:       38 d1                           cmp    %dl,%cl
  400756:       75 2c                           jne    400784 <_Z4Mainv+0x154>
  400758:       48 83 c0 01                     add    $0x1,%rax
  40075c:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400762:       75 d4                           jne    400738 <_Z4Mainv+0x108>
  400764:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400768:       31 c0                           xor    %eax,%eax
  40076a:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  40076f:       0f 94 c0                        sete   %al
  400772:       c5 f8 77                        vzeroupper 
  400775:       48 81 c4 08 03 00 00            add    $0x308,%rsp
  40077c:       41 5a                           pop    %r10
  40077e:       5d                              pop    %rbp
  40077f:       49 8d 62 f8                     lea    -0x8(%r10),%rsp
  400783:       c3                              retq   
  400784:       b9 60 08 40 00                  mov    $0x400860,%ecx
  400789:       ba 26 00 00 00                  mov    $0x26,%edx
  40078e:       be 29 08 40 00                  mov    $0x400829,%esi
  400793:       bf 38 08 40 00                  mov    $0x400838,%edi
  400798:       c5 f8 77                        vzeroupper 
  40079b:       e8 50 fd ff ff                  callq  4004f0 <__assert_fail@plt>

这篇关于GCC 5和更高版本中对AVX2的支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆