原子测试和设置86:内联汇编或编译器生成的锁BTS? [英] Atomic test-and-set in x86: inline asm or compiler-generated lock bts?

查看:282
本文介绍了原子测试和设置86:内联汇编或编译器生成的锁BTS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于至强融核编译时,以下code抛出
错误:不支持cmovc k1om

The below code when compiled for a xeon phi throws Error: cmovc is not supported on k1om.

但它确实正确编译为一个普通至强处理器。

But it does compile properly for a regular xeon processor.

#include<stdio.h>
int main()
{
    int in=5;
    int bit=1;
    int x=0, y=1;
    int& inRef = in;
    printf("in=%d\n",in);
    asm("lock bts %2,%0\ncmovc %3,%1" : "+m" (inRef), "+r"(y) : "r" (bit), "r"(x));
    printf("in=%d\n",in);
}

编译器 - ICC(ICC)13.1.0 20130121

相关问题:位测试和设置(BTS )在TBB原子变量

推荐答案

IIRC,第一代的Xeon Phi是基于P5内核(奔腾和Pentium MMX)。 CMOV 并没有出台,直到P6(又名Pentium Pro的)。所以我觉得这是正常的。

IIRC, first-gen Xeon Phi is based on P5 cores (Pentium, and Pentium MMX). cmov wasn't introduced until P6 (aka Pentium Pro). So I think this is normal.

只是让编译器完成其工作,通过写一个正常的三元操作符。

Just let the compiler do its job by writing a normal ternary operator.

二, CMOV 是这方面比国家经贸委相差太远了选择,因为你要产生一个0或1根据进位标志。请看下文中我ASM code。

Second, cmov is a far worse choice for this than setc, since you want to produce a 0 or 1 based on the carry flag. See my asm code below.

另外请注意, BTS 与存储操作数是超慢,所以你不希望它反正产生code,电除尘器。在CPU上的德codeS x86指令为微指令(像现代至强)。据 http://agner.org/optimize/ BTS M,R BTS M,I 甚至在P5慢得多,所以不要做。

Also note that bts with a memory operand is super-slow, so you don't want it to generate that code anyway, esp. on a CPU that decodes x86 instructions into uops (like a modern Xeon). According to http://agner.org/optimize/, bts m, r is much slower than bts m, i even on P5, so don't do that.

而要求编译器为是在寄存器中,或者更好的,只是不使用内联汇编本。

Just ask the compiler for in to be in a register, or better yet, just don't use inline asm for this.

由于OP显然都想这原子的工作,最好的解决方法是使用C ++ 11的的std ::原子:: fetch_or ,并把它留给了编译器生成锁定BTS

Since the OP apparently wants this to work atomically, the best solution is to use C++11's std::atomic::fetch_or, and leave it up to the compiler to generate lock bts.

的std :: atomic_flag 有一个 test_and_set 的功能,但如果IDK有办法紧密收拾他们。也许是在结构位域?虽然不太可能。我还没有看到的原子操作,和std :: bitset

std::atomic_flag has a test_and_set function, but IDK if there a way to pack them tightly. Maybe as bitfields in a struct? Unlikely though. I also don't see atomic operations for std::bitset.

不幸的是,gcc和铿锵的当前版本不生成锁定BTS fetch_or ,即使多 - 快速直接操作数形式是可用的。我想出了以下( godbolt链接):

Unfortunately, current versions of gcc and clang don't generate lock bts from fetch_or, even when the much-faster immediate-operand form is usable. I came up with the following (godbolt link):

#include <atomic>
#include <stdio.h>

// wastes instructions when the return value isn't used.
// gcc 6.0 has syntax for using flags as output operands

// IDK if lock BTS is better than lock cmpxchg.
// However, gcc doesn't use lock BTS even with -Os
int atomic_bts_asm(std::atomic<unsigned> *x, int bit) {
  int retval = 0;  // the compiler still provides a zeroed reg as input even if retval isn't used after the asm :/
  // Letting the compiler do the xor means we can use a m constraint, in case this is inlined where we're storing to already zeroed memory
  // It unfortunately doesn't help for overwriting a value that's already known to be 0 or 1.
  asm( // "xor      %[rv], %[rv]\n\t"
       "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"  // hope that the compiler zeroed with xor to avoid a partial-register stall
        : [x] "+m" (*x), [rv] "+rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}

// save an insn when retval isn't used, but still doesn't avoid the setc
// leads to the less-efficient setc/ movzbl sequence when the result is needed :/
int atomic_bts_asm2(std::atomic<unsigned> *x, int bit) {
  uint8_t retval;
  asm( "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"
        : [x] "+m" (*x), [rv] "=rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}


int atomic_bts(std::atomic<unsigned> *x, unsigned int bit) {
  // bit &= 31; // stops gcc from using shlx?
  unsigned bitmask = 1<<bit;
  //int oldval = x->fetch_or(bitmask, std::memory_order_relaxed);

  int oldval = x->fetch_or(bitmask, std::memory_order_acq_rel);
  // acquire and release semantics are free on x86
  // Also, any atomic rmw needs a lock prefix, which is a full memory barrier (seq_cst) anyway.

  if (oldval & bitmask)
    return 1;
  else
    return 0;
}

作为<一个讨论href=\"http://stackoverflow.com/questions/33666617/which-is-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and\">Which是设置在x86汇编寄存器为零最好的方法:XOR,MOV或与, XOR /设置标志/ 国家经贸委? 为当需要将其结果作为0或1的值的所有现代CPU的最优序列。我没有真正考虑P5为,但 setcc 快于P5所以它应该是罚款。

As discussed in Which is best way to set a register to zero in x86 assembly: xor, mov or and?, xor / set-flags / setc is the optimal sequence for all modern CPUs when the result is needed as a 0-or-1 value. I haven't actually considered P5 for that, but setcc is fast on P5 so it should be fine.

当然,如果你想在这个分支存储它代替,内联汇编和C之间的边界是一个障碍。花两个指令为0或1,只测试/店就可以了,将是pretty愚蠢的。

Of course, if you want to branch on this instead of storing it, the boundary between inline asm and C is an obstacle. Spending two instructions to store a 0 or 1, only to test/branch on it, would be pretty dumb.

gcc6的标志操作数的语法肯定是值得考虑中,如果它是一个选项。 (也许不是,如果你需要一个编译器,英特尔针对MIC)。

gcc6's flag-operand syntax would certainly be worth looking in to, if it's an option. (Probably not if you need a compiler that targets Intel MIC.)

这篇关于原子测试和设置86:内联汇编或编译器生成的锁BTS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆