如何实现LongAccumulator,使其更高效? [英] How is LongAccumulator implemented, so that it is more efficient?

查看:792
本文介绍了如何实现LongAccumulator,使其更高效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道新的Java(8)引入了新的同步工具,例如 LongAccumulator (在原子包下)。



在文档中,说LongAccumulator更有效率的变量从几个线程更新是非常频繁的。



我不知道如何更有效地实现它。

解决方案>

这是一个很好的问题,因为它显示了使用共享内存的并发编程的一个非常重要的特性。在详细介绍之前,我必须退一步。看看下面的类:

  class Accumulator {
private final AtomicLong value = new AtomicLong
public void accumulate(long value){
this.value.addAndGet(value);
}
public long get(){
return this.value.get();
}
}

如果您创建此类的一个实例并调用在循环中一个线程中的 accumulate(1),执行会真的很快。但是,如果您从两个线程在同一个实例上调用该方法,则执行速度将会减慢两个数值



您必须查看内存架构,以了解会发生什么。大多数系统现在都有非统一内存访问。具体地,每个核具有其自己的L1高速缓存,其通常被结构化为具有64个八位字节的高速缓存线。如果核对存储器位置执行原子递增操作,则其首先必须对相应的高速缓存行进行独占访问。这是非常昂贵的,如果它没有独家访问,由于与所有其他核心所需的协调。



有一个简单和反直觉的技巧来解决这个问题。看看下面的类:

  class Accumulator {
private final AtomicLong [] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
}
public void accumulate(long value){
int index = getMagicValue();
this.values [index%values.length] .addAndGet(value);
}
public long get(){
long result = 0;
for(AtomicLong value:values){
result + = value.get();
}
return result;
}
}



乍看起来,这个类似乎更贵由于附加操作。然而,它可能比第一类快几倍,因为它具有更高的概率,即执行核已经具有对所需高速缓存行的独占访问。



为了让这真的快,你必须考虑更多的东西:




  • 在不同的缓存线上。否则,您会将一个问题替换为另一个问题,即误分享。在Java中,您可以使用 long [8 * 4] 用于此目的,并且只使用索引 0 code> 8 , 16 24 。 $ b
  • 必须明智地选择计数器的数量。如果不同的计数器太少,仍然有太多的缓存开关。如果有太多的计数器,则会浪费L1缓存中的空间。

  • 方法 getMagicValue 应返回一个与核心id相关的值。 li>


总而言之, LongAccumulator 在某些使用情况下效率更高,因为它使用冗余内存操作,以便减少高速缓存线必须在内核之间交换的次数。另一方面,读操作稍微贵一些,因为它们必须创建一致的结果。


I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).

In the documentation it says that the LongAccumulator is more efficient when the variable update from several threads is frequent.

I wonder how is it implemented to be more efficient?

解决方案

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:

class Accumulator {
    private final AtomicLong value = new AtomicLong(0);
    public void accumulate(long value) {
        this.value.addAndGet(value);
    }
    public long get() {
        return this.value.get();
    }
}

If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.

You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's really expensive, if it has no exclusive access yet, due to the required coordination with all other cores.

There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:

class Accumulator {
    private final AtomicLong[] values = {
        new AtomicLong(0),
        new AtomicLong(0),
        new AtomicLong(0),
        new AtomicLong(0),
    };
    public void accumulate(long value) {
        int index = getMagicValue();
        this.values[index % values.length].addAndGet(value);
    }
    public long get() {
        long result = 0;
        for (AtomicLong value : values) {
            result += value.get();
        }
        return result;
    }
}

At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.

To make this really fast, you have to consider a few more things:

  • The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
  • The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
  • The method getMagicValue should return a value with an affinity to the core id.

To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

这篇关于如何实现LongAccumulator,使其更高效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆