使用OpenMP的x86上的原子最小 [英] Atomic Minimum on x86 using OpenMP

查看:50
本文介绍了使用OpenMP的x86上的原子最小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

OpenMP是否支持C ++ 11的最低要求?如果OpenMP没有可移植的方法:是否可以使用x86或amd64功能实现此目的?

Does OpenMP support an atomic minimum for C++11? If OpenMP has no portable method: Is there some way of doing it using a x86 or amd64 feature?

在OpenMP规范中,我没有发现C ++的任何东西,但是Fortran版本似乎支持它.有关详细信息,请参见v3.1的2.8.5.对于C ++,它说明

In the OpenMP specifications I found nothing for C++ but the Fortran version seems to support it. See 2.8.5 of the v3.1 for the details. For C++ it states

binop是+,*,-,/,&,^,|,<<或>>之一.

binop is one of +, *, -, /, &, ^, |, <<, or >>.

但是对于Fortran来说,它说明

but for Fortran it states

intrinsic_procedure_name是MAX,MIN,IAND,IOR或IEOR之一.

intrinsic_procedure_name is one of MAX, MIN, IAND, IOR, or IEOR.

如果您对更多上下文感兴趣:我正在寻找一种无互斥的方法来执行以下操作:

In case you are interested in more context: I am looking for a mutex free method of doing the following:

vector<omp_lock_t>lock;
vector<int>val;

#pragma omp parallel
{
  // ...
  int x = ...;
  int y = ...;
  if(y < val[x]){
    omp_set_lock(&lock[x]);
    if(y < val[x])
      val[x] = y;
    omp_unset_lock(&lock[x]);
  }
}

我知道您可以使用归约算法来计算最小值.我知道,在某些情况下,这种方法的性能要优于任何原子最小方法.但是,我也知道情况并非如此.

I know that you can compute the minimum using a reduce algorithm. I know that there are circumstances where this largely outperforms any atomic minimum approach. However, I also know that this is not the case in my situation.

在我的情况下,一种选择稍快一点的方法是

One option that is slightly faster in my case is

  int x = ...;
  int y = ...;
  while(y < val[x])
    val[x] = y;

但这不是原子操作.

所有较新的GPU都具有此功能,我在CPU上缺少它.(有关OpenCL,请参见atom_min.)

All the newer GPUs have this feature and I am missing it on the CPU. (See atom_min for OpenCL.)

推荐答案

C ++的OpenMP规范不支持原子最小.C ++ 11也没有.

The OpenMP specification for C++ does not have support for atomic minimum. Neither does C++11.

我假设在您的算法中, x 可以计算到任何有效索引,而与线程无关.我建议更改您的算法,以便每个线程使用其自己的 val 数组,然后在最后进行最终协调,也可以通过索引对其进行并行化.这将完全避免锁和原子,并为您带来了为每个线程分离数据的好处,即没有机会进行错误的缓存共享.换句话说,它应该更快.

I am assuming that in your algorithm, x can compute to any valid index, regardless of thread. I would suggest changing your algorithm, so that each thread uses its own val array and then do a final reconciliation at the end, which can also be parallelized by index. This will avoid locks and atomics completely and give you the benefit of separating the data for each thread, i.e. no chance for false cache sharing. In other words, it should be faster.

这篇关于使用OpenMP的x86上的原子最小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆