原子运算符,SSE/AVX和OpenMP [英] Atomic operators, SSE/AVX, and OpenMP

查看:436
本文介绍了原子运算符,SSE/AVX和OpenMP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道SSE/AVX操作(例如加法和乘法)是否可以是原子操作?我问这的原因是,在OpenMP中,原子构造仅适用于有限的一组运算符.它不适用于例如SSE/AVX添加.

I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions.

假设我有一个与SSE寄存器相对应的数据类型float4,并且为float4定义了加法运算符以进行SSE加法.在OpenMP中,我可以使用以下代码对数组进行归约:

Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code:

float4 sum4 = 0.0f; //sets all four values to zero
#pragma omp parallel
{
    float4 sum_private = 0.0f;
    #pragma omp for nowait
    for(int i=0; i<N; i+=4) {
        float4 val = float4().load(&array[i]) //load four floats into a SSE register
        sum_private4 += val; //sum_private4 = _mm_addps(val,sum_private4)
    }
    #pragma omp critical
    sum4 += sum_private;
}
float sum = horizontal_sum(sum4); //sum4[0] + sum4[1] + sum4[2] + sum4[3]

但是一般来说,原子的速度比关键的速度快,我的直觉告诉我SSE/AVX操作应该是原子的(即使OpenMP不支持它).这是OpenMP的限制吗?我可以例如使用英特尔线程构建基块或pthread将其作为原子操作来完成?

But atomic is faster than critical in general and my instinct tells me SSE/AVX operations should be atomic (even if OpenMP does not support it). Is this a limitation of OpenMP? Could I use for example e.g. Intel Threading Building Blocks or pthreads to do this as an atomic operation?

基于Jim Cownie的评论,我创建了一个新的函数,这是最好的解决方案.我确认它给出了正确的结果.

Based on Jim Cownie's comment I created a new function which is the best solution. I verified that it gives the correct result.

float sum = 0.0f;
#pragma omp parallel reduction(+:sum)
{
    Vec4f sum4 = 0.0f;  
    #pragma omp for nowait
    for(int i=0; i<N; i+=4) {
        Vec4f val = Vec4f().load(&A[i]); //load four floats into a SSE register
        sum4 += val; //sum4 = _mm_addps(val,sum4)
    }
    sum += horizontal_add(sum4);
}

基于本主题的评论Jim Cownie和Mystical的评论 OpenMP原子_mm_add_pd 我现在意识到简化实现在OpenMP中不一定使用原子运算符,最好依靠OpenMP的简化实现,而不是尝试使用原子运算.

based on comments Jim Cownie and comments by Mystical at this thread OpenMP atomic _mm_add_pd I realize now that the reduction implementation in OpenMP does not necessarily use atomic operators and it's best to rely on OpenMP's reduction implementation rather than try to do it with atomic.

推荐答案

SSE&通常,AVX并不是原子操作(但是多字CAS肯定是不错的选择).

SSE & AVX in general are not atomic operations (but multiword CAS would sure be sweet).

您可以使用tbb或ppl中的可组合类模板进行更通用的缩减和线程本地初始化,将其视为由线程ID索引的同步哈希表;它可以与OpenMP一起正常使用,并且不会自行增加任何额外的线程.

You can use the combinable class template in tbb or ppl for more general purpose reductions and thread local initializations, think of it as a synchronized hash table indexed by thread id; it works just fine with OpenMP and doesn't spin up any extra threads on its own.

您可以在tbb网站和msdn上找到示例.

You can find examples on the tbb site and on msdn.

关于评论,请考虑以下代码:

Regarding the comment, consider this code:

x = x + 5

您应该真正将其视为以下内容,尤其是在涉及多个线程时:

You should really think of it as the following particularly when multiple threads are involved:

while( true ){
    oldValue = x
    desiredValue = oldValue + 5
    //this conditional is the atomic compare and swap
    if( x == oldValue )
       x = desiredValue
       break;
}

有意义吗?

这篇关于原子运算符,SSE/AVX和OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆