这是不正确的代码生成与m256值数组的clang bug？ [英] Is this incorrect code generation with arrays of m256 values a clang bug?

查看：791 发布时间：2016/11/22 22:48:03 c++ clang compiler-optimization avx2

本文介绍了这是不正确的代码生成与__m256值数组的clang bug？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了什么似乎是一个错误导致不正确的代码生成与clang 3.4,3.5和3.6中继。实际触发问题的源是相当复杂，但我已经能够减少到这个自包含的例子：

I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6 trunk. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to this self-contained example:

#include <iostream>
#include <immintrin.h>
#include <string.h>

struct simd_pack
{
    enum { num_vectors = 1 };
    __m256i _val[num_vectors];
};

simd_pack load_broken(int8_t *p)
{
    simd_pack pack;
    for (int i = 0; i < simd_pack::num_vectors; ++i) pack._val[i] = _mm256_loadu_si256(reinterpret_cast<__m256i *>(p + i * 32));
    return pack;
}

void store_broken(int8_t *p, simd_pack pack)
{
    for (int i = 0; i < simd_pack::num_vectors; ++i) _mm256_storeu_si256(reinterpret_cast<__m256i *>(p + i * 32), pack._val[i]);    
}

void test_broken(int8_t *out, int8_t *in1, size_t n)
{
    size_t i = 0;
    for (; i + 31 < n; i += 32)
    {
        simd_pack p1 = load_broken(in1 + i);
        store_broken(out + i, p1);
    }   
}

int main()
{
    int8_t in_buf[256];
    int8_t out_buf[256];
    for (size_t i = 0; i < 256; ++i) in_buf[i] = i;

    test_broken(out_buf, in_buf, 256);
    if (memcmp(in_buf, out_buf, 256)) std::cout << "test_broken() failed!" << std::endl;    

    return 0;
}

上面的摘要：我有一个简单的类型 simd_pack ，其中包含一个成员，一个 __ m256i 值的数组。在我的应用程序中，有操作符和函数接受这些类型，但问题可以通过上面的例子说明。具体来说， test_broken（）应从 in1 数组读取，然后将其值复制到 out 数组。因此，调用 memcmp（）在 main（）应该返回零。我编译以上使用以下：

A summary of the above: I have a simple type called simd_pack that contains one member, an array of one __m256i value. In my application, there are operators and functions that take these types, but the problem can be illustrated by the above example. Specifically, test_broken() should read from the in1 array and then just copy its value over to the out array. Therefore, the call to memcmp() in main() should return zero. I compile the above using the following:

clang++-3.6 bug_test.cc -o bug_test -mavx -O3

我发现在优化级别 -O0 和 -O1 ，测试通过，而在 -O2 和 -O3 ，测试失败。我试过用gcc 4.4，4.6，4.7和4.8以及Intel C ++ 13.0编译同一个文件，并且所有优化级别的测试通过。

I find that on optimization levels -O0 and -O1, the test passes, while on levels -O2 and -O3, the test fails. I've tried compiling the same file with gcc 4.4, 4.6, 4.7, and 4.8, as well as Intel C++ 13.0, and the test passes on all optimization levels.

仔细观察生成的代码，下面是优化级别 -O3 上生成的程序集：

Taking a closer look at the generated code, here's the assembly generated on optimization level -O3:

0000000000400a40 <test_broken(signed char*, signed char*, unsigned long)>:
  400a40:       55                      push   %rbp
  400a41:       48 89 e5                mov    %rsp,%rbp
  400a44:       48 81 e4 e0 ff ff ff    and    $0xffffffffffffffe0,%rsp
  400a4b:       48 83 ec 40             sub    $0x40,%rsp
  400a4f:       48 83 fa 20             cmp    $0x20,%rdx
  400a53:       72 2f                   jb     400a84 <test_broken(signed char*, signed char*, unsigned long)+0x44>
  400a55:       31 c0                   xor    %eax,%eax
  400a57:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  400a5e:       00 00 
  400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
  400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
  400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
  400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)
  400a74:       48 8d 48 20             lea    0x20(%rax),%rcx
  400a78:       48 83 c0 3f             add    $0x3f,%rax
  400a7c:       48 39 d0                cmp    %rdx,%rax
  400a7f:       48 89 c8                mov    %rcx,%rax
  400a82:       72 dc                   jb     400a60 <test_broken(signed char*, signed char*, unsigned long)+0x20>
  400a84:       48 89 ec                mov    %rbp,%rsp
  400a87:       5d                      pop    %rbp
  400a88:       c5 f8 77                vzeroupper 
  400a8b:       c3                      retq   
  400a8c:       0f 1f 40 00             nopl   0x0(%rax)

I'll reproduce the key part for emphasis:

  400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
  400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
  400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
  400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)

这是一种头疼。它首先使用我要求的未对齐移动将256位加载到 ymm0 ，然后它存储 xmm0 （其中只包含读取的数据的低128位），然后立即从刚刚写入的堆栈位置读取256位到 ymm0 。效果是 ymm0 的高128位（被写入输出缓冲区）是垃圾，导致测试失败。

This is kind of head-scratching. It first loads 256 bits into ymm0 using the unaligned move that I asked for, then it stores xmm0 (which only contains the lower 128 bits of the data that was read) to the stack, then immediately reads 256 bits into ymm0 from the stack location that was just written to. The effect is that ymm0's upper 128 bits (which get written to the output buffer) are garbage, causing the test to fail.

有没有一些好的理由，为什么这可能发生，除了只是一个编译器错误？我违反了一些规则，通过 simd_pack 类型保存一个数组 __ m256i 值？它似乎与此有关;如果我将 _val 更改为单个值而不是数组，那么生成的代码按预期工作。但是，我的应用程序需要 _val 作为数组（其长度取决于C ++模板参数）。

Is there some good reason why this could be happening, other than just a compiler bug? Am I violating some rule by having the simd_pack type hold an array of __m256i values? It certainly seems to be related to that; if I change _val to be a single value instead of an array, then the generated code works as intended. However, my application requires _val to be an array (its length is dependent upon a C++ template parameter).

任何想法？

这是不正确的代码生成与m256值数组的clang bug？ [英] Is this incorrect code generation with arrays of m256 values a clang bug?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

这是不正确的代码生成与__m256值数组的clang bug？ [英] Is this incorrect code generation with arrays of __m256 values a clang bug?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

这是不正确的代码生成与m256值数组的clang bug？ [英] Is this incorrect code generation with arrays of m256 values a clang bug?

登录关闭