CAS与同步表现 [英] CAS vs synchronized performance

查看:182
本文介绍了CAS与同步表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在已经有这个问题了很长一段时间,试图阅读大量资源并了解正在发生的事情 - 但我仍然未能很好地理解为什么事情就是这样。

I've had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I've still failed to get a good understanding of why things are the way they are.

简单地说我试图测试一个 CAS 将如何执行vs synchronized 在竞争而非环境中。我已经把这个 JMH 测试:

Simply put I'm trying to test how a CAS would perform vs synchronized in contended and not environments. I've put up this JMH test:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class SandBox {

    Object lock = new Object();

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(SandBox.class.getSimpleName())
                .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                .shouldFailOnError(true)
                .build();
        new Runner(opt).run();
    }

    @State(Scope.Thread)
    public static class Holder {

        private long number;

        private AtomicLong atomicLong;

        @Setup
        public void setUp() {
            number = ThreadLocalRandom.current().nextLong();
            atomicLong = new AtomicLong(number);
        }
    }

    @Fork(1)
    @Benchmark
    public long sync(Holder holder) {
        long n = holder.number;
        synchronized (lock) {
            n = n * 123;
        }

        return n;
    }

    @Fork(1)
    @Benchmark
    public AtomicLong cas(Holder holder) {
        AtomicLong al = holder.atomicLong;
        al.updateAndGet(x -> x * 123);
        return al;
    }

    private Object anotherLock = new Object();

    private long anotherNumber = ThreadLocalRandom.current().nextLong();

    private AtomicLong anotherAl = new AtomicLong(anotherNumber);

    @Fork(1)
    @Benchmark
    public long syncShared() {
        synchronized (anotherLock) {
            anotherNumber = anotherNumber * 123;
        }

        return anotherNumber;
    }

    @Fork(1)
    @Benchmark
    public AtomicLong casShared() {
        anotherAl.updateAndGet(x -> x * 123);
        return anotherAl;
    }

    @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
    @Benchmark
    public long syncSharedNonBiased() {
        synchronized (anotherLock) {
            anotherNumber = anotherNumber * 123;
        }

        return anotherNumber;
    }

}

结果:

Benchmark                                           Mode  Cnt     Score      Error  Units
spinLockVsSynchronized.SandBox.cas                  avgt    5   212.922 ±   18.011  ns/op
spinLockVsSynchronized.SandBox.casShared            avgt    5  4106.764 ± 1233.108  ns/op
spinLockVsSynchronized.SandBox.sync                 avgt    5  2869.664 ±  231.482  ns/op
spinLockVsSynchronized.SandBox.syncShared           avgt    5  2414.177 ±   85.022  ns/op
spinLockVsSynchronized.SandBox.syncSharedNonBiased  avgt    5  2696.102 ±  279.734  ns/op

在非共享案例中 CAS 到目前为止更快,我希望如此。但在共同的情况下,事情是相反的 - 这是我无法理解的。我不认为这与偏置锁定有关,因为在线程持有锁定5秒(AFAIK)之后会发生这种情况并且这种情况不会发生且测试只是证明了这一点。

In the non-shared case CASis by far faster, which I would expect. But in shared case, things are the other way around - and this I can't understand. I don't think this is related to biased locking, as that would happen after a threads holds the lock for 5 seconds (AFAIK) and this does not happen and the test is just proof of that.

老实说,我希望这只是我的测试是错误的,有人 jmh 专业知识会出现并指出我错误的设置。

I honestly hope it's just my tests that are wrong, and someone having jmh expertise would come along and just point me to the wrong set-up here.

推荐答案

主要的误解是假设您正在比较 CAS 同步。鉴于JVM如何实现 synchronized ,您使用<$ c $比较基于 CAS 的算法的性能c> AtomicLong 具有用于实现 synchronized 的基于 CAS 的算法的性能。

The main misconception is the assumption that you are comparing "CAS vs. synchronized". Given, how sophisticated JVMs implement synchronized, you are comparing the performance of a CAS-based algorithm using AtomicLong with the performance of the CAS-based algorithm used to implement synchronized.

类似于 Lock ,对象监视器的内部信息基本上由 int组成状态,告知它是否已被拥有以及嵌套的频率,对当前所有者线程的引用以及等待能够获取它的线程队列。昂贵的方面是等待队列。将一个线程放入队列,将其从线程调度中删除,并在当前所有者释放监视器时最终将其唤醒,这些操作可能需要很长时间。

Similar to Lock, the internal information for an object monitor basically consist of an int status telling whether it has been owned and how often it is nested, a reference to the current owner thread and a queue of threads waiting to be able to acquire it. The expensive aspect is the waiting queue. Putting a thread into the queue, removing it from thread scheduling, and eventually waking it up when the current owner releases the monitor, are operations that can take a significant time.

但是,在无竞争的情况下,等待队列当然不涉及。获取监视器包含一个 CAS ,将状态从无主(通常为零)更改为拥有,获取一次(猜测典型值)。如果成功,线程可以继续执行关键操作,然后是释放,这意味着只需写入具有必要内存可见性的无主状态并唤醒另一个被阻塞的线程(如果有)。

However, in the uncontended case, the waiting queue is, of course, not involved. Acquiring the monitor consist of a single CAS to change the status from "unowned" (usually zero) to "owned, acquired once" (guess the typical value). If successful, the thread can proceed with the critical action, followed by a release which implies just writing the "unowned" state with the necessary memory visibility and waking up another blocked thread, if there is one.

由于等待队列的成本要高得多,因此实施通常会尝试通过执行一定量的旋转来避免它,即使在竞争情况下也是如此,使得多次重复 CAS 在回退入队列之前的尝试。如果所有者的关键动作与单个乘法一样简单,则在旋转阶段已经释放监视器的可能性很高。请注意, synchronized 是不公平的,允许旋转线程立即继续,即使已经排队的线程等待的时间更长。

Since the wait queue is the significantly more expensive thing, implementations usually try to avoid it even in the contended case by performing some amount of spinning, making several repeated CAS attempts before falling back to enqueuing the thread. If the critical action of the owner is as simple as a single multiplication, chances are high that the monitor will be released during the spinning phase already. Note that synchronized is "unfair", allowing a spinning thread to proceed immediately, even if there are already enqueued threads waiting far longer.

如果你比较 synchronized(lock)执行的基本操作{n = n * 123; } 当没有涉及排队时,通过 al.updateAndGet(x - > x * 123); ,你会注意到他们是粗略的看齐。主要区别在于 AtomicLong 方法将重复争用的乘法,而对于 synchronized 方法,存在风险如果在旋转期间没有进展,则被放入队列。

If you compare the fundamental operations performed by the synchronized(lock){ n = n * 123; } when no queuing is involved and by al.updateAndGet(x -> x * 123);, you’ll notice that they are roughly on par. The main difference is that the AtomicLong approach will repeat the multiplication on contention while for the synchronized approach, there is a risk of being put into the queue if no progress has been made during spinning.

但是 synchronized 允许锁定粗化,以便代码在同一个对象上重复同步,这可能是与调用 syncShared 方法的基准循环相关。除非还有一种融合多个 CAS 更新 AtomicLong 的方法,否则这可以给出 synchronized 一个显着的优势。 (另请参阅本文,内容涉及上述几个方面)

But synchronized allows lock coarsening for code repeatedly synchronizing on the same object, which might be relevant for a benchmark loop calling the syncShared method. Unless there’s also a way to fuse multiple CAS updates of an AtomicLong, this can give synchronized a dramatic advantage. (See also this article covering several aspects discussed above)

请注意,由于 synchronized 的不公平性质,创建比CPU核心更多的线程不必是个问题。在最好的情况下,线程数减去核心数线程最终在队列中,从不唤醒,而其余线程在旋转阶段成功,每个核心上有一个线程。但同样,没有在CPU内核上运行的线程也不能降低 AtomicLong 更新的性能,因为它们既不能使当前值无效,也不能使其失败 CAS 尝试。

Note that due to the "unfair" nature of synchronized, creating far more threads than CPU cores doesn’t have to be a problem. In the best case, "number of threads minus number of cores" threads end up on the queue, never waking up, while the remaining threads succeed in the spinning phase, one thread on each core. But likewise, threads not running on a CPU core can’t reduce the performance of the AtomicLong update as they can neither, invalidate the current value to other threads nor make a failed CAS attempt.

在任何一种情况下,当 CAS 时在非共享对象的成员变量上或在非共享对象上执行 synchronized 时,JVM可以检测操作的本地性质并忽略大部分相关成本。但这可能取决于几个微妙的环境因素。

In either case, when CASing on the member variable of an unshared object or when performing synchronized on an unshared object, the JVM may detect the local nature of the operation and elide most of the associated costs. But this may depend on several subtle environmental aspects.

底线是原子更新与原子更新之间没有简单的决定。 已同步块。对于更昂贵的操作,事情变得更加有趣,这可能会增加线程在 synchronized 的竞争情况下入队的可能性,这可能使操作必须是可接受的在原子更新的竞争情况下重复。

The bottom line is that there is no easy decision between atomic updates and synchronized blocks. Things get far more interesting with more expensive operations, which may raise the likelihood of threads getting enqueued in the contended case of synchronized, which may make it acceptable that the operation has to be repeated in the contended case of an atomic update.

这篇关于CAS与同步表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆