_mm256_load_ps在调试模式下导致Google/基准细分错误 [英] _mm256_load_ps cause segmentation fault with google/benchmark in debug mode
问题描述
- 以下代码可以在发布和调试模式下运行.
#include <immintrin.h>
constexpr int n_batch = 10240;
constexpr int n = n_batch * 8;
#pragma pack(32)
float a[n];
float b[n];
float c[n];
#pragma pack()
int main() {
for(int i = 0; i < n; ++i)
c[i] = a[i] * b[i];
for(int i = 0; i < n; i += 4) {
__m128 av = _mm_load_ps(a + i);
__m128 bv = _mm_load_ps(b + i);
__m128 cv = _mm_mul_ps(av, bv);
_mm_store_ps(c + i, cv);
}
for(int i = 0; i < n; i += 8) {
__m256 av = _mm256_load_ps(a + i);
__m256 bv = _mm256_load_ps(b + i);
__m256 cv = _mm256_mul_ps(av, bv);
_mm256_store_ps(c + i, cv);
}
}
- 以下代码只能在发布模式下运行,而在调试模式下会出现分段错误.
#include <immintrin.h>
#include "benchmark/benchmark.h"
constexpr int n_batch = 10240;
constexpr int n = n_batch * 8;
#pragma pack(32)
float a[n];
float b[n];
float c[n];
#pragma pack()
static void BM_Scalar(benchmark::State &state) {
for(auto _: state)
for(int i = 0; i < n; ++i)
c[i] = a[i] * b[i];
}
BENCHMARK(BM_Scalar);
static void BM_Packet_4(benchmark::State &state) {
for(auto _: state) {
for(int i = 0; i < n; i += 4) {
__m128 av = _mm_load_ps(a + i);
__m128 bv = _mm_load_ps(b + i);
__m128 cv = _mm_mul_ps(av, bv);
_mm_store_ps(c + i, cv);
}
}
}
BENCHMARK(BM_Packet_4);
static void BM_Packet_8(benchmark::State &state) {
for(auto _: state) {
for(int i = 0; i < n; i += 8) {
__m256 av = _mm256_load_ps(a + i); // Signal: SIGSEGV (signal SIGSEGV: invalid address (fault address: 0x0))
__m256 bv = _mm256_load_ps(b + i);
__m256 cv = _mm256_mul_ps(av, bv);
_mm256_store_ps(c + i, cv);
}
}
}
BENCHMARK(BM_Packet_8);
BENCHMARK_MAIN();
推荐答案
您的数组未与32对齐.您可以使用调试器进行检查.
Your arrays aren't aligned by 32. You could check this with a debugger.
#pragma pack(32)
仅对齐struct/union/class成员,
#pragma pack(32)
only aligns struct/union/class members, as documented by MS. C++ arrays are a different kind of object and aren't affected at all by that MSVC pragma. (I think you're actually using GCC's or clang's version of it, though, because MSVC generally uses vmovups
not vmovaps
)
对于静态或自动存储中的数组(未动态分配),在C ++ 11和更高版本中对齐数组的最简单方法是 alignas(32)
.这是完全可移植的,与GNU C __ attribute __((aligned(32)))
或MSVC的等效版本不同.
For arrays in static or automatic storage (not dynamically allocated), the easiest way to align arrays in C++11 and later is alignas(32)
. That's fully portable, unlike GNU C __attribute__((aligned(32)))
or whatever MSVC's equivalent is.
alignas(32) float a[n];
alignas(32) float b[n];
alignas(32) float c[n];
AVX:数据对齐:存储崩溃,存储,加载,loadu不会解释为什么在优化级别上会有所不同:优化的代码会将一个负载折叠到 vmulps
的内存源操作数中(与SSE不同),该操作不需要对齐.(大概第一个数组恰好对齐了.)
AVX: data alignment: store crash, storeu, load, loadu doesn't explains why there's a difference depending on optimization level: optimized code will fold one load into a memory source operand for vmulps
which (unlike SSE) doesn't require alignment. (Presumably the first array happens to be sufficiently aligned.)
未经优化的代码将单独执行 _mm256_load_ps
和 vmovaps
需要对齐的负载.
Un-optimized code will do the _mm256_load_ps
separately with a vmovaps
alignment-required load.
( _mm256_loadu_ps
将始终避免使用需要对齐的负载,因此,如果不能保证数据对齐,请使用该负载.)
(_mm256_loadu_ps
will always avoid using alignment-required loads, so use that if you can't guarantee your data is aligned.)
这篇关于_mm256_load_ps在调试模式下导致Google/基准细分错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!