是挥发性昂贵吗? [英] Is volatile expensive?

查看:162
本文介绍了是挥发性昂贵吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读 The JSR-133 Cookbook for Compiler Writers 有关volatile的实现,尤其是部分与原子指令的交互我假设读取volatile变量而不更新它需要LoadLoad或LoadStore屏障。在页面下面,我看到LoadLoad和LoadStore实际上是X86 CPU上的无操作。这是否意味着可以在x86上没有显式缓存无效的情况下完成易失性读操作,并且与正常变量读取一样快(忽略volatile的重新排序约束)?



我相信我不明白这一点。有人能照顾我吗?



编辑:我想知道多处理器环境是否有差异。在单CPU系统上,CPU可能会看到它自己的线程缓存,如John V.所说,但是在多CPU系统上,必须有一些配置选项给CPU,这是不够的,主内存必须命中,在多CP​​U系统上,对吗?



PS:在我了解更多关于这个的方式,我偶然发现了以下伟大的文章,我将在此分享我的链接:




解决方案

在英特尔,一个无竞争的易失性读取是相当便宜。如果我们考虑下面的简单情况:

  public static long l; 

public static void run(){
if(l == -1)
System.exit(-1);

if(l == -2)
System.exit(-1);
}

使用Java 7打印汇编代码的能力,run方法看起来像: / p>

 #{method}'run2''()V'in'Test2'
#[sp + 0x10] sp的调用者)
0xb396ce80:mov%eax,-0x3000(%esp)
0xb396ce87:push%ebp
0xb396ce88:sub $ 0x8,%esp; *同步条目
; - Test2 :: run2 @ -1(line 33)
0xb396ce8e:mov $ 0xffffffff,%ecx
0xb396ce93:mov $ 0xffffffff,%ebx
0xb396ce98:mov $ 0x6fa2b2f0,%esi; {oop('Test2')}
0xb396ce9d:mov 0x150(%esi),%ebp
0xb396cea3:mov 0x154(%esi),%edi; * getstatic l
; - Test2 :: run @ 0(line 33)
0xb396cea9:cmp%ecx,%ebp
0xb396ceab:jne 0xb396ceaf
0xb396cead:cmp%ebx,%edi
0xb396ceaf:je 0xb396cece; * getstatic l
; - Test2 :: run @ 14(line 37)
0xb396ceb1:mov $ 0xfffffffe,%ecx
0xb396ceb6:mov $ 0xffffffff,%ebx
0xb396cebb:cmp%ecx,%ebp $ b b 0xb396cebd:jne 0xb396cec1
0xb396cebf:cmp%ebx,%edi
0xb396cec1:je 0xb396ceeb; * return
; - Test2 :: run @ 28(line 40)
0xb396cec3:add $ 0x8,%esp
0xb396cec6:pop%ebp
0xb396cec7:test%eax,0xb7732000; {poll_return}
; ...删除行

getstatic,第一次涉及从内存加载,第二次跳过负载,因为该值从已经加载到的寄存器(长是64位,并且在我的32位笔记本电脑它使用2个寄存器)重复使用。

 如果我们使l变量volatile, #{method}'run2''()V'in'Test2'
#[sp + 0x10](sp调用者)
0xb3ab9340:mov%eax,-0x3000(%esp)
0xb3ab9347:push%ebp
0xb3ab9348:sub $ 0x8,%esp; *同步条目
; - Test2 :: run2 @ -1(line 32)
0xb3ab934e:mov $ 0xffffffff,%ecx
0xb3ab9353:mov $ 0xffffffff,%ebx
0xb3ab9358:mov $ 0x150,%ebp
0xb3ab935d:movsd 0x6fb7b2f0(%ebp),%xmm0; {oop('Test2')}
0xb3ab9365:movd%xmm0,%eax
0xb3ab9369:psrlq $ 0x20,%xmm0
0xb3ab936e:movd%xmm0,%edx; * getstatic l
; - Test2 :: run @ 0(line 32)
0xb3ab9372:cmp%ecx,%eax
0xb3ab9374:jne 0xb3ab9378
0xb3ab9376:cmp%ebx,%edx
0xb3ab9378:je 0xb3ab93ac
0xb3ab937a:mov $ 0xfffffffe,%ecx
0xb3ab937f:mov $ 0xffffffff,%ebx
0xb3ab9384:movsd 0x6fb7b2f0(%ebp),%xmm0; {oop('Test2')}
0xb3ab938c:movd%xmm0,%ebp
0xb3ab9390:psrlq $ 0x20,%xmm0
0xb3ab9395:movd%xmm0,%edi; * getstatic l
; - Test2 :: run @ 14(line 36)
0xb3ab9399:cmp%ecx,%ebp
0xb3ab939b:jne 0xb3ab939f
0xb3ab939d:cmp%ebx,%edi
0xb3ab939f:je 0xb3ab93ba; * return
; ...删除行

在这种情况下,getstatic对变量l的引用涉及来自存储器的加载,即该值不能在多个易失性读取中保持在寄存器中。为了确保有一个原子读操作,该值从主存储器读入MMX寄存器 movsd 0x6fb7b2f0(%ebp),%xmm0 使读操作成为一条指令前面的例子中我们看到64位值通常需要在32位系统上进行两次32位读取)。



因此,易失性读取的总成本大致相当于内存负载并且可以像L1高速缓存访​​问一样便宜。然而,如果另一个核正在写入volatile变量,则高速缓存行将被无效,需要主存储器或者可能是L3高速缓存访​​问。实际成本将在很大程度上取决于CPU架构。即使在Intel和AMD之间,缓存一致性协议也不同。


After reading The JSR-133 Cookbook for Compiler Writers about the implementation of volatile, especially section "Interactions with Atomic Instructions" I assume that reading a volatile variable without updating it needs a LoadLoad or a LoadStore barrier. Further down the page I see that LoadLoad and LoadStore are effectively no-ops on X86 CPUs. Does this mean that volatile read operations can be done without a explicit cache invalidation on x86, and is as fast a normal variable read (disregarding the reordering constraints of volatile)?

I believe I don't understand this correctly. Could someone care to enlighten me?

EDIT: I wonder if there are differences in multi-processor environments. On single CPU systems the CPU might look at it's own thread caches, as John V. states, but on multi CPU systems there must be some config option to the CPUs that this is not enough and main memory has to be hit, making volatile slower on multi cpu systems, right?

PS: On my way to learn more about this I stumbled about the following great articles, and since this question may be interesting to others, I'll share my links here:

解决方案

On Intel an un-contended volatile read is quite cheap. If we consider the following simple case:

public static long l;

public static void run() {        
    if (l == -1)
        System.exit(-1);

    if (l == -2)
        System.exit(-1);
}

Using Java 7's ability to print assembly code the run method looks something like:

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb396ce80: mov    %eax,-0x3000(%esp)
0xb396ce87: push   %ebp
0xb396ce88: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 33)
0xb396ce8e: mov    $0xffffffff,%ecx
0xb396ce93: mov    $0xffffffff,%ebx
0xb396ce98: mov    $0x6fa2b2f0,%esi   ;   {oop('Test2')}
0xb396ce9d: mov    0x150(%esi),%ebp
0xb396cea3: mov    0x154(%esi),%edi   ;*getstatic l
                                    ; - Test2::run@0 (line 33)
0xb396cea9: cmp    %ecx,%ebp
0xb396ceab: jne    0xb396ceaf
0xb396cead: cmp    %ebx,%edi
0xb396ceaf: je     0xb396cece         ;*getstatic l
                                    ; - Test2::run@14 (line 37)
0xb396ceb1: mov    $0xfffffffe,%ecx
0xb396ceb6: mov    $0xffffffff,%ebx
0xb396cebb: cmp    %ecx,%ebp
0xb396cebd: jne    0xb396cec1
0xb396cebf: cmp    %ebx,%edi
0xb396cec1: je     0xb396ceeb         ;*return
                                    ; - Test2::run@28 (line 40)
0xb396cec3: add    $0x8,%esp
0xb396cec6: pop    %ebp
0xb396cec7: test   %eax,0xb7732000    ;   {poll_return}
;... lines removed

If you look at the 2 references to getstatic, the first involves a load from memory, the second skips the load as the value is reused from the register(s) it is already loaded into (long is 64 bit and on my 32 bit laptop it uses 2 registers).

If we make the l variable volatile the resulting assembly is different.

# {method} 'run2' '()V' in 'Test2'
#           [sp+0x10]  (sp of caller)
0xb3ab9340: mov    %eax,-0x3000(%esp)
0xb3ab9347: push   %ebp
0xb3ab9348: sub    $0x8,%esp          ;*synchronization entry
                                    ; - Test2::run2@-1 (line 32)
0xb3ab934e: mov    $0xffffffff,%ecx
0xb3ab9353: mov    $0xffffffff,%ebx
0xb3ab9358: mov    $0x150,%ebp
0xb3ab935d: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab9365: movd   %xmm0,%eax
0xb3ab9369: psrlq  $0x20,%xmm0
0xb3ab936e: movd   %xmm0,%edx         ;*getstatic l
                                    ; - Test2::run@0 (line 32)
0xb3ab9372: cmp    %ecx,%eax
0xb3ab9374: jne    0xb3ab9378
0xb3ab9376: cmp    %ebx,%edx
0xb3ab9378: je     0xb3ab93ac
0xb3ab937a: mov    $0xfffffffe,%ecx
0xb3ab937f: mov    $0xffffffff,%ebx
0xb3ab9384: movsd  0x6fb7b2f0(%ebp),%xmm0  ;   {oop('Test2')}
0xb3ab938c: movd   %xmm0,%ebp
0xb3ab9390: psrlq  $0x20,%xmm0
0xb3ab9395: movd   %xmm0,%edi         ;*getstatic l
                                    ; - Test2::run@14 (line 36)
0xb3ab9399: cmp    %ecx,%ebp
0xb3ab939b: jne    0xb3ab939f
0xb3ab939d: cmp    %ebx,%edi
0xb3ab939f: je     0xb3ab93ba         ;*return
;... lines removed

In this case both of the getstatic references to the variable l involves a load from memory, i.e. the value can not be kept in a register across multiple volatile reads. To ensure that there is an atomic read the value is read from main memory into an MMX register movsd 0x6fb7b2f0(%ebp),%xmm0 making the read operation a single instruction (from the previous example we saw that 64bit value would normally require two 32bit reads on a 32bit system).

So the overall cost of a volatile read will roughly equivalent of a memory load and can be as cheap as a L1 cache access. However if another core is writing to the volatile variable, the cache-line will be invalidated requiring a main memory or perhaps an L3 cache access. The actual cost will depend heavily on the CPU architecture. Even between Intel and AMD the cache coherency protocols are different.

这篇关于是挥发性昂贵吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆