_mm256_store_ps()函数是否是原子函数?与openmp一起使用时 [英] Is _mm256_store_ps() function is atomic ? while using alongside openmp

查看:105
本文介绍了_mm256_store_ps()函数是否是原子函数?与openmp一起使用时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个使用英特尔AVX技术并执行矢量乘法和加法的简单程序.在这里,我与此同时使用Open MP.但是由于函数调用_mm256_store_ps()而导致分段错误.

I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps().

我已经尝试过OpenMP原子功能,例如原子的,关键的等,因此,如果此函数本质上是原子的,并且多个内核试图同时执行,但无法正常工作.

I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working.

#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<immintrin.h>
#include<omp.h>
#define N 64

__m256 multiply_and_add_intel(__m256 a, __m256 b, __m256 c) {
  return _mm256_add_ps(_mm256_mul_ps(a, b),c);
}

void multiply_and_add_intel_total_omp(const float* a, const float* b, const float* c, float* d)
{
  __m256 a_intel, b_intel, c_intel, d_intel;
  #pragma omp parallel for private(a_intel,b_intel,c_intel,d_intel)
  for(long i=0; i<N; i=i+8) {
    a_intel = _mm256_loadu_ps(&a[i]);
    b_intel = _mm256_loadu_ps(&b[i]);
    c_intel = _mm256_loadu_ps(&c[i]);
    d_intel = multiply_and_add_intel(a_intel, b_intel, c_intel);
    _mm256_store_ps(&d[i],d_intel);
  }
}
int main()
{
    srand(time(NULL));
    float * a = (float *) malloc(sizeof(float) * N);
    float * b = (float *) malloc(sizeof(float) * N);
    float * c = (float *) malloc(sizeof(float) * N);
    float * d_intel_avx_omp = (float *)malloc(sizeof(float) * N);
    int i;
    for(i=0;i<N;i++)
    {
        a[i] = (float)(rand()%10);
        b[i] = (float)(rand()%10);
        c[i] = (float)(rand()%10);
    }
    double time_t = omp_get_wtime();
    multiply_and_add_intel_total_omp(a,b,c,d_intel_avx_omp);
    time_t = omp_get_wtime() - time_t;
    printf("\nTime taken to calculate with AVX2 and OMP : %0.5lf\n",time_t);
  }

  free(a);
  free(b);
  free(c);
  free(d_intel_avx_omp);
    return 0;
}

我希望我得到d = a * b + c,但它显示出分段错误.我试图在没有OpenMP的情况下执行相同的任务,并且它工作正常.如果存在任何兼容性问题或缺少任何内容,请告诉我.

I expect that I will get d = a * b + c but it is showing segmentation fault. I have tried to perform the same task without OpenMP and it working errorless. Please let me know if there is any compatibility issue or I am missing any part.

  • gcc版本7.3.0
  • 英特尔®酷睿™i3-3110M处理器
  • 操作系统Ubuntu 18.04
  • 打开MP 4.5,我已经执行了命令$ echo |cpp -fopenmp -dM |grep -i open,它显示了#define _OPENMP 201511
  • 要编译的命令,gcc first_int.c -mavx -fopenmp
  • gcc version 7.3.0
  • Intel® Core™ i3-3110M Processor
  • OS Ubuntu 18.04
  • Open MP 4.5, I have executed the command $ echo |cpp -fopenmp -dM |grep -i open and it showed #define _OPENMP 201511
  • Command to compile, gcc first_int.c -mavx -fopenmp

** UPDATE **

** UPDATE **

根据讨论和建议,新代码为

As per the discussions and suggestions, the new code is,

 float * a = (float *) aligned_alloc(N, sizeof(float) * N);
 float * b = (float *) aligned_alloc(N, sizeof(float) * N);
 float * c = (float *) aligned_alloc(N, sizeof(float) * N);
 float * d_intel_avx_omp = (float *)aligned_alloc(N, sizeof(float) * N);

无法正常工作.

请注意,我正在尝试比较常规计算,avx计算和avx + openmp计算.这是我得到的结果,

Just a note, I was trying to compare general calculations, avx calculation and avx+openmp calculation. This is the result I got,

  • 不使用AVX进行计算所需的时间:0.00037
  • 使用AVX计算所需的时间:0.00024
  • 使用AVX和OMP计算所需的时间:0.00019
  • Time taken to calculate without AVX : 0.00037
  • Time taken to calculate with AVX : 0.00024
  • Time taken to calculate with AVX and OMP : 0.00019

N = 50000

N = 50000

推荐答案

_mm256_store_ps 说:

将a中的256位(由8个压缩的单精度(32位)浮点元素组成)存储到内存中. mem_addr必须在32字节的边界上对齐,否则可能会生成常规保护异常.

您可以使用 _mm256_storeu_si256 代替未对齐的商店.

You can use _mm256_storeu_si256 instead for unaligned stores.

一个更好的选择是将所有数组对齐在32字节边界上(对于256位avx寄存器),并使用对齐的加载和存储来获得最佳性能,因为未对齐的加载/存储越过缓存行边界会导致性能下降.

A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.

使用 std::aligned_alloc (或C11 aligned_allocmemalignposix_memalign,无论您可用的是什么),而不是malloc(size),例如:

Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:

float* allocate_aligned(size_t n) {
    constexpr size_t alignment = alignof(__m256);
    return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);

在C ++-17中,new可以对齐分配:

In C++-17 new can allocate with alignment:

float* allocate_aligned(size_t n) {
    constexpr auto alignment = std::align_val_t{alignof(__m256)};
    return new(alignment) float[n];
}


或者,使用 Vc:可移植的零开销C ++类型进行明确的数据并行编程,为您对齐堆分配的SIMD向量:


Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:

#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>

Vc::float_v random_float_v() {
    alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
    for(unsigned i = 0; i < Vc::float_v::Size; ++i)
        t[i] = std::rand() % 10;
    return Vc::float_v(t, Vc::Aligned);
}

unsigned reverse_crc32(void const* vbegin, void const* vend) {
    unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
    unsigned const* end = reinterpret_cast<unsigned const*>(vend);
    unsigned r = 0;
    while(begin != end)
        r = __builtin_ia32_crc32si(r, *--end);
    return r;
}

int main() {
    constexpr size_t N = 65536;
    constexpr size_t M = N / Vc::float_v::Size;

    std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);

    for(unsigned i = 0; i < M; ++i) {
        a[i] = random_float_v();
        b[i] = random_float_v();
        c[i] = random_float_v();
    }

    auto t0 = std::chrono::high_resolution_clock::now();
    for(unsigned i = 0; i < M; ++i)
        d_intel_avx_omp[i] = a[i] * b[i] + c[i];
    auto t1 = std::chrono::high_resolution_clock::now();

    double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
    unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
    std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}

并行版本:

#include <tbb/parallel_for.h>
// ...
    auto t0 = std::chrono::high_resolution_clock::now();
    tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
        d_intel_avx_omp[i] = a[i] * b[i] + c[i];
    });
    auto t1 = std::chrono::high_resolution_clock::now();

这篇关于_mm256_store_ps()函数是否是原子函数?与openmp一起使用时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆