Java比较和交换语义和性能 [英] Java compare and swap semantics and performance

查看:213
本文介绍了Java比较和交换语义和性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Java中的比较和交换的语义是什么?也就是说, AtomicInteger 的比较和交换方法是否保证在不同线程之间对原子整数实例的特定存储器位置的有序访问,或者保证对所有



文档




  • weakCompareAndSet 以原子方式读取并有条件地写入变量,但不创建任何发生前的排序,因此不保证任何变量的先前或后续读取和写入除了 weakCompareAndSet 的目标。

  • compareAndSet 和更新操作,例如 getAndIncrement 具有读取和写入易失性变量的记忆效应。



从API文档中可以明显看出, compareAndSet 就像是一个volatile变量。但是, weakCompareAndSet 应该只是更改其特定的内存位置。因此,如果该存储器位置对于单个处理器的高速缓存是排他的,则 weakCompareAndSet 应该比常规 compareAndSet $ c>。



我问这是因为我通过运行 threadnum 不同的线程,将 threadnum 从1变为8,并且具有 totalwork = 1e9 (代码是用Scala编写的, JVM语言,但是它的意义和字节码翻译在这种情况下与Java是同构的 - 这个短片应该是清楚的):

  val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal [AtomicInteger] {
override def initialValue = new AtomicInteger(0)
}

def loop_atomic_tlocal_cas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_tlocal_cnt.get
while(i< ; until){
i + = 1
acnt.compareAndSet(i - 1,i)
}
acnt.get + i
}

def loop_atomic_weakcas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_cnt
while(i < until){
i + = 1
acnt.weakCompareAndSet(i - 1,i)
}
acnt.get + i
}
$ b b def loop_atomic_tlocal_weakcas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_tlocal_cnt.get
while(i i + = 1
acnt.weakCompareAndSet(i - 1,i)
}
acnt.get + i
}

在具有4个双2.8 GHz内核和2.67 GHz 4核i7处理器的AMD上。 JVM是Sun Server热点JVM 1.6。结果显示无性能差异。



规格:AMD 8220 4x双核@ 2.8 GHz



测试名称:loop_atomic_tlocal_cas




  • 线程号:1



运行时间:(显示最后3个)
7504.562 7502.817 7504.626(avg = 7415.637 min = 7147.628 max = 7504.886)




  • 线程号:2



运行次数:(显示最后3条)
3751.553 3752.589 3751.519(avg = 3713.5513 min = 3574.708 max = 3752.949)




  • 线号:4


$ b b

运行时间:(显示最后3个)
1890.055 1889.813 1890.047(avg = 2065.7207 min = 1804.652 max = 3755.852)




  • 线程号:8



运行次数:(显示最后3条)
960.12 989.453 970.842(avg = 1058.8776 min = 940.492最大= 1893.127)






测试名称:loop_atomic_weakcas




  • 线号:1



运行次数:(显示最后3条)
7325.425 7057.03 7325.407(avg = 7231.8682 min = 7057.03 max = 7325.45)




  • 线号:2



运行次数:(显示最后3次)
3663.21 3665.838 3533.406(平均= 3607.2149分钟= 3529.177最大= 3665.838)




  • 线号:4



运行次数:(显示最后3条)
3664.163 1831.979 1835.07(avg = 2014.2086 min = 1797.997 max = 3664.163)




  • 线程号:8



运行次数:(显示最后3条)
940.504 928.467 921.376(avg = 943.665 min = 919.985 max = 997.681)






测试名称:loop_atomic_tlocal_weakcas




  • 线程号:1



运行时间:(显示最后3条)
7502.876 7502.857 7502.933(avg = 7414.8132 min = 7145.869 max = 7502.933)




  • 线程号:2



运行时间:显示最后3)
3752.623 3751.53 3752.434(avg = 3710.1782 min = 3574.398 max = 3752.623)




  • / li>


运行次数:(显示最后3条)
1876.723 1881.069 1876.538(avg = 4110.4221 min = 1804.62 max = 12467.351) p>


  • 线号:8



运行时间:(显示最后3条)
959.329 1010.53 969.767(avg = 1072.8444 min = 959.329 max = 1880.049)



规格:Intel i7四核@ 2.67 GHz



测试名称:loop_atomic_tlocal_cas




  • 线程号:1



运行次数:(显示最后3条)
8138.3175 8130.0044 8130.1535(avg = 8119.2888 min = 8049.6497 max = 8150.1950)




  • 线号:2



运行次数: 3)
4067.7399 4067.5403 4068.3747(avg = 4059.6344 min = 4026.2739 max = 4068.5455)




  • 线数:4



运行次数:(显示最后3次)
2033.4389 2033.2695 2033.2918(avg = 2030.5825分= 2017.6880最大= 2035.0352)






测试名称:loop_atomic_weakcas




  • 。:1



运行次数:(显示最后3条)
8130.5620 8129.9963 8132.3382(avg = 8114.0052 min = 8042.0742 max = 8132.8542)




  • 线程号:2



运行时间:(显示最后3个)
4066.9559 4067.0414 4067.2080(avg = 4086.0608 min = 4023.6822 max = 4335.1791)




  • 线号:4



运行次数:(显示最后3条)
2034.6084 2169.8127 2034.5625(avg = 2047.7025 min = 2032.8131 max = 2169.8127)






测试名称:loop_atomic_tlocal_weakcas




  • 线号:1



运行次数:(显示最后3条)
8132.5267 8132.0299 8132.2415 avg = 8114.9328 min = 8043.3674 max = 8134.0418)




  • 线号:2



运行次数:(显示最后3次)
4066.5924 4066.5797 4066.6519(avg = 4059.1911 min = 4025.0703 max = 4066.8547)




  • 线号:4



运行次数:(显示最后3条)
2033.2614 2035.5754 2036.9110(avg = 2033.2958 min = 2023.5082 max = 2038.8750)






虽然上面例子中的线程局部变量可能在相同的缓存行中,在我看来,在常规CAS和其弱版本之间没有可观察到的性能差异。



这可能意味着,比较和交换作为完全成熟的内存栅栏,即表现为一个易变的变量。



问题:这个观察是否正确?此外,有没有一个已知的架构或Java分布的弱比较和集合实际上更快?如果不是,首先使用弱CAS的优势是什么?

解决方案

弱比较和交换可以作为一个完整的volatile变量,这取决于JVM的实现,肯定。事实上,如果在某些架构上不可能以比正常CAS更显着更高性能的方式实现弱CAS,我不会感到惊讶。在这些架构上,可能的情况是,弱CAS实现与完整CAS完全相同。或者它可能只是因为你的JVM没有太多的优化投入使弱CASs特别快,因此,当前实现只是调用一个完整的CAS,因为它很快实现,未来的版本将细化。



JLS简单地说,弱CAS不建立发生先于关系,所以很简单,保证它导致的修改在其他线程中可见。在这种情况下,您所获得的是保证compare-and-set操作是原子操作,但不能保证(可能)新值的可见性。



一般来说,尽量避免使用这种方法通过实验做出关于并发相关行为的任何结论。有这么多变量要考虑,如果你不遵循JLS保证是正确的,那么你的程序可以在任何时候(或许在一个不同的架构,也许在更多的积极的优化,这是由于您的代码的布局稍微改变,也许在未来的JVM的构建,还不存在等等)。有从不假设您可以放弃使用不能保证的内容的理由,因为实验证明有效。


What is the semantics of compare and swap in Java? Namely, does the compare and swap method of an AtomicInteger just guarantee ordered access between different threads to the particular memory location of the atomic integer instance, or does it guarantee ordered access to all the locations in memory, i.e. it acts as if it were a volatile (a memory fence).

From the docs:

  • weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of any variables other than the target of the weakCompareAndSet.
  • compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.

It's apparent from the API documentation that compareAndSet acts as if it were a volatile variable. However, weakCompareAndSet is supposed to just change its specific memory location. Thus, if that memory location is exclusive to the cache of a single processor, weakCompareAndSet is supposed to be much faster than the regular compareAndSet.

I'm asking this because I've benchmarked the following methods by running threadnum different threads, varying threadnum from 1 to 8, and having totalwork=1e9 (the code is written in Scala, a statically compiled JVM language, but both its meaning and bytecode translation are isomorphic to that of Java in this case - this short snippets should be clear):

val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal[AtomicInteger] {
  override def initialValue = new AtomicInteger(0)
}

def loop_atomic_tlocal_cas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.compareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_cnt
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

def loop_atomic_tlocal_weakcas = {
  var i = 0
  val until = totalwork / threadnum
  val acnt = atomic_tlocal_cnt.get
  while (i < until) {
    i += 1
    acnt.weakCompareAndSet(i - 1, i)
  }
  acnt.get + i
}

on an AMD with 4 dual 2.8 GHz cores, and a 2.67 GHz 4-core i7 processor. The JVM is Sun Server Hotspot JVM 1.6. The results show no performance difference.

Specs: AMD 8220 4x dual-core @ 2.8 GHz

Test name: loop_atomic_tlocal_cas

  • Thread num.: 1

Run times: (showing last 3) 7504.562 7502.817 7504.626 (avg = 7415.637 min = 7147.628 max = 7504.886 )

  • Thread num.: 2

Run times: (showing last 3) 3751.553 3752.589 3751.519 (avg = 3713.5513 min = 3574.708 max = 3752.949 )

  • Thread num.: 4

Run times: (showing last 3) 1890.055 1889.813 1890.047 (avg = 2065.7207 min = 1804.652 max = 3755.852 )

  • Thread num.: 8

Run times: (showing last 3) 960.12 989.453 970.842 (avg = 1058.8776 min = 940.492 max = 1893.127 )


Test name: loop_atomic_weakcas

  • Thread num.: 1

Run times: (showing last 3) 7325.425 7057.03 7325.407 (avg = 7231.8682 min = 7057.03 max = 7325.45 )

  • Thread num.: 2

Run times: (showing last 3) 3663.21 3665.838 3533.406 (avg = 3607.2149 min = 3529.177 max = 3665.838 )

  • Thread num.: 4

Run times: (showing last 3) 3664.163 1831.979 1835.07 (avg = 2014.2086 min = 1797.997 max = 3664.163 )

  • Thread num.: 8

Run times: (showing last 3) 940.504 928.467 921.376 (avg = 943.665 min = 919.985 max = 997.681 )


Test name: loop_atomic_tlocal_weakcas

  • Thread num.: 1

Run times: (showing last 3) 7502.876 7502.857 7502.933 (avg = 7414.8132 min = 7145.869 max = 7502.933 )

  • Thread num.: 2

Run times: (showing last 3) 3752.623 3751.53 3752.434 (avg = 3710.1782 min = 3574.398 max = 3752.623 )

  • Thread num.: 4

Run times: (showing last 3) 1876.723 1881.069 1876.538 (avg = 4110.4221 min = 1804.62 max = 12467.351 )

  • Thread num.: 8

Run times: (showing last 3) 959.329 1010.53 969.767 (avg = 1072.8444 min = 959.329 max = 1880.049 )

Specs: Intel i7 quad-core @ 2.67 GHz

Test name: loop_atomic_tlocal_cas

  • Thread num.: 1

Run times: (showing last 3) 8138.3175 8130.0044 8130.1535 (avg = 8119.2888 min = 8049.6497 max = 8150.1950 )

  • Thread num.: 2

Run times: (showing last 3) 4067.7399 4067.5403 4068.3747 (avg = 4059.6344 min = 4026.2739 max = 4068.5455 )

  • Thread num.: 4

Run times: (showing last 3) 2033.4389 2033.2695 2033.2918 (avg = 2030.5825 min = 2017.6880 max = 2035.0352 )


Test name: loop_atomic_weakcas

  • Thread num.: 1

Run times: (showing last 3) 8130.5620 8129.9963 8132.3382 (avg = 8114.0052 min = 8042.0742 max = 8132.8542 )

  • Thread num.: 2

Run times: (showing last 3) 4066.9559 4067.0414 4067.2080 (avg = 4086.0608 min = 4023.6822 max = 4335.1791 )

  • Thread num.: 4

Run times: (showing last 3) 2034.6084 2169.8127 2034.5625 (avg = 2047.7025 min = 2032.8131 max = 2169.8127 )


Test name: loop_atomic_tlocal_weakcas

  • Thread num.: 1

Run times: (showing last 3) 8132.5267 8132.0299 8132.2415 (avg = 8114.9328 min = 8043.3674 max = 8134.0418 )

  • Thread num.: 2

Run times: (showing last 3) 4066.5924 4066.5797 4066.6519 (avg = 4059.1911 min = 4025.0703 max = 4066.8547 )

  • Thread num.: 4

Run times: (showing last 3) 2033.2614 2035.5754 2036.9110 (avg = 2033.2958 min = 2023.5082 max = 2038.8750 )


While it's possible that thread locals in the example above end up in the same cache lines, it seems to me that there is no observable performance difference between regular CAS and its weak version.

This could mean that, in fact, a weak compare and swap acts as fully fledged memory fence, i.e. acts as if it were a volatile variable.

Question: Is this observation correct? Also, is there a known architecture or Java distribution for which a weak compare and set is actually faster? If not, what is the advantage of using a weak CAS in the first place?

解决方案

A weak compare and swap could act as a full volatile variable, depending on the implementation of the JVM, sure. In fact, I wouldn't be surprised if on certain architectures it is not possible to implement a weak CAS in a notably more performant way than the normal CAS. On these architectures, it may well be the case that weak CASes are implemented exactly the same as a full CAS. Or it might simply be that your JVM has not had much optimisation put into making weak CASes particularly fast, so the current implementation just invokes a full CAS because it's quick to implement, and a future version will refine this.

The JLS simply says that a weak CAS does not establish a happens-before relationship, so it's simply that there is no guarantee that the modification it causes is visible in other threads. All you get in this case is the guarantee that the compare-and-set operation is atomic, but with no guarantees about the visibility of the (potentially) new value. That's not the same as guaranteeing that it won't be seen, so your tests are consistent with this.

In general, try to avoid making any conclusions about concurrency-related behaviour through experimentation. There are so many variables to take into account, that if you don't follow what the JLS guarantees to be correct, then your program could break at any time (perhaps on a different architecture, perhaps under more aggressive optimisation that's prompted by a slight change in the layout of your code, perhaps under future builds of the JVM that don't exist yet, etc.). There's never a reason to assume you can get away with something that's stated not to be guaranteed, because experiments show that "it works".

这篇关于Java比较和交换语义和性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆