SSE指令:哪些CPU可以做原子16B内存操作? [英] SSE instructions: which CPUs can do atomic 16B memory operations?

查看:29
本文介绍了SSE指令:哪些CPU可以做原子16B内存操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑 x86 CPU 上的单个内存访问(单个读取或单个写入,而不是读取 + 写入)SSE 指令.该指令正在访问 16 个字节(128 位)的内存,并且访问的内存位置对齐到 16 个字节.

文档英特尔® 64 位架构内存订购白皮书"指出,对于读取或写入地址在 8 字节边界上对齐的四字(8 字节)的指令",内存操作似乎作为单个内存执行访问与内存类型无关.

问题:是否存在 Intel/AMD/etc x86 CPU 保证读取或写入与 16 字节边界对齐的 16 字节(128 位)作为单个内存访问执行?是吗?,它是哪种特定类型的 CPU(Core2/Atom/K8/Phenom/...)?如果您对此问题提供答案(是/否),还请指定用于确定答案的方法 - PDF 文档查找、蛮力测试、数学证明或您使用的任何其他方法用于确定答案.

这个问题与http://research.swtch等问题有关.com/2010/02/off-to-races.html

<小时>

更新:

我用 C 语言创建了一个简单的测试程序,您可以在您的计算机上运行它.请在您的 Phenom、Athlon、Bobcat、Core2、Atom、Sandy Bridge 或您碰巧拥有的任何支持 SSE2 的 CPU 上编译并运行它.谢谢.

//编译://gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2////确保您至少有两个物理 CPU 内核或超线程.#include #include #include #include #include typedef int v4si __attribute__ ((vector_size (16)));易失性 v4si x;无符号 n1[16] __attribute__((aligned(64)));无符号 n2[16] __attribute__((aligned(64)));void* thread1(void *arg) {for (int i=0; i<100*1000*1000; i++) {int mask = _mm_movemask_ps((__m128)x);n1[掩码]++;x = (v4si){0,0,0,0};}返回空;}void* thread2(void *arg) {for (int i=0; i<100*1000*1000; i++) {int mask = _mm_movemask_ps((__m128)x);n2[掩码]++;x = (v4si){-1,-1,-1,-1};}返回空;}int main() {//检查内存对齐如果 ((((uintptr_t)&x) & 0x0f) != 0 )中止();memset(n1, 0, sizeof(n1));memset(n2, 0, sizeof(n2));pthread_t t1, t2;pthread_create(&t1, NULL, thread1, NULL);pthread_create(&t2, NULL, thread2, NULL);pthread_join(t1, NULL);pthread_join(t2, NULL);for (unsigned i=0; i<16; i++) {for (int j=3; j>=0; j--)printf("%d", (i>>j)&1);printf(" %10u %10u", n1[i], n2[i]);if(i>0 && i<0x0f) {if(n1[i] || n2[i])printf("不是一次内存访问!");}printf("
");}返回0;}

我笔记本中的 CPU 是 Core Duo(不是 Core2).这个特定的 CPU 没有通过测试,它实现了 16 字节内存读/写,粒度为 8 字节.输出为:

0000 96905702 105120001 0 00010 0 00011 22 12924 没有一次内存访问!0100 0 00101 0 00110 0 00111 0 01000 0 01001 0 01010 0 01011 0 01100 3092557 1175 没有一次内存访问!1101 0 01110 0 01111 1719 99975389

解决方案

英特尔® 64 位和 IA-32 架构开发人员手册:卷.3A,现在包含您提到的内存订购白皮书的规格,在第 8.2.3.1 节中说,正如您自己所说,

<前>Intel-64 内存排序模型保证,对于以下每个内存访问指令,组成内存操作似乎执行作为单个内存访问:• 读取或写入单个字节的指令.• 读取或写入字(2 个字节)的指令,其地址在 2字节边界.• 读取或写入地址对齐的双字(4 字节)的指令在 4 字节边界上.• 读取或写入地址对齐的四字(8 个字节)的指令一个 8 字节的边界.任何锁定指令(XCHG 指令或另一个读-修改-写带有 LOCK 前缀的指令)似乎作为不可分割的和不间断的加载顺序,然后是存储(无论对齐如何).

现在,由于上面的列表不包含双四字(16 字节)的相同语言,因此该架构不保证访问 16 字节内存的指令是原子的.

话虽如此,最后一段确实暗示了一条出路,即带有 LOCK 前缀的 CMPXCHG16B 指令.您可以使用 CPUID 指令来确定您的处理器是否支持 CMPXCHG16B(CX16"功能位).

在相应的 AMD 文档中,AMD64 技术 AMD64 架构程序员手册第 2 卷:系统编程,我找不到类似的清晰语言.

测试程序结果

(修改测试程序以将#iterations 增加 10 倍)

在至强 X3450 (x86-64) 上:

<前>0000 999998139 15720001 0 00010 0 00011 0 00100 0 00101 0 00110 0 00111 0 01000 0 01001 0 01010 0 01011 0 01100 0 01101 0 01110 0 01111 1861 999998428

在 Xeon 5150(32 位)上:

<前>0000 999243100 2830870001 0 00010 0 00011 0 00100 0 00101 0 00110 0 00111 0 01000 0 01001 0 01010 0 01011 0 01100 0 01101 0 01110 0 01111 756900 999716913

在 Opteron 2435 (x86-64) 上:

<前>0000 999995893 19010001 0 00010 0 00011 0 00100 0 00101 0 00110 0 00111 0 01000 0 01001 0 01010 0 01011 0 01100 0 01101 0 01110 0 01111 4107 999998099

这是否意味着 Intel 和/或 AMD 保证这些机器上的 16 字节内存访问是原子的?恕我直言,它没有.它不在文档中作为有保证的架构行为,因此人们无法知道在这些特定处理器上 16 字节内存访问是否真的是原子的,或者测试程序是否只是出于某种原因未能触发它们.因此依赖它是危险的.

编辑 2:如何使测试程序失败

哈!我设法使测试程序失败.在与上述相同的 Opteron 2435 上,使用相同的二进制文件,但现在通过numactl"工具运行它,指定每个线程在单独的套接字上运行,我得到:

<前>0000 999998634 59900001 0 00010 0 00011 0 00100 0 00101 0 00110 0 00111 0 01000 0 01001 0 01010 0 01011 0 01100 0 1 没有一次内存访问!1101 0 01110 0 01111 1366 999994009

这意味着什么?好吧,Opteron 2435 可能会也可能不会保证 16 字节内存访问对于套接字内访问是原子的,但至少在两个套接字之间的 HyperTransport 互连上运行的缓存一致性协议并没有提供这样的保证.

编辑 3:线程函数的 ASM,应GJ"的要求.

这是为 Opteron 2435 系统上使用的 GCC 4.4 x86-64 版本的线程函数生成的 asm:

<代码>.globl 线程2.type thread2, @function线程2:.LFB537:.cfi_startprocmovdqa .LC3(%rip), %xmm1xorl %eax, %eax.p2align 5,,24.p2对齐 3.L11:movaps x(%rip), %xmm0包括 %eaxmovaps %xmm1, x(%rip)movmskps %xmm0, %edxmovslq %edx, %rdx包括 n2(,%rdx,4)cmpl $1000000000, %eaxjne .L11xorl %eax, %eax退.cfi_endproc.LFE537:.size thread2, .-thread2.p2align 5,,31.globl 线程 1.type thread1, @function线程1:.LFB536:.cfi_startproc像素或 %xmm1, %xmm1xorl %eax, %eax.p2align 5,,24.p2对齐 3.L15:movaps x(%rip), %xmm0包括 %eaxmovaps %xmm1, x(%rip)movmskps %xmm0, %edxmovslq %edx, %rdx包括 n1(,%rdx,4)cmpl $1000000000, %eaxjne .L15xorl %eax, %eax退.cfi_endproc

为了完整性,.LC3 是包含线程 2 使用的 (-1, -1, -1, -1) 向量的静态数据:

<代码>.LC3:.long -1.long -1.long -1.long -1.ident "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)".section .note.GNU-stack,"",@progbits

另请注意,这是 AT&T ASM 语法,而不是 Windows 程序员可能更熟悉的 Intel 语法.最后,这是使用 March=native 这使得 GCC 更喜欢 MOVAPS;但没关系,如果我使用 March=core2,它将使用 MOVDQA 存储到 x,我仍然可以重现失败.

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.

The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.

The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.

This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html


Update:

I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("
");
        }

        return 0;
}

The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:

0000    96905702      10512
0001           0          0
0010           0          0
0011          22      12924  Not a single memory access!
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100     3092557       1175  Not a single memory access!
1101           0          0
1110           0          0
1111        1719   99975389

解决方案

In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.2.3.1, as you note yourself, that

The Intel-64 memory ordering model guarantees that, for each of the following 
memory-access instructions, the constituent memory operation appears to execute 
as a single memory access:

• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.

Any locked instruction (either the XCHG instruction or another read-modify-write
 instruction with a LOCK prefix) appears to execute as an indivisible and 
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.

Now, since the above list does NOT contain the same language for double quadword (16 bytes), it follows that the architecture does NOT guarantee that instructions which access 16 bytes of memory are atomic.

That being said, the last paragraph does hint at a way out, namely the CMPXCHG16B instruction with the LOCK prefix. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).

In the corresponding AMD document, AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 2: System Programming, I can't find similar clear language.

EDIT: Test program results

(Test program modified to increase #iterations by a factor of 10)

On a Xeon X3450 (x86-64):

0000   999998139       1572
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        1861  999998428

On a Xeon 5150 (32-bit):

0000   999243100     283087
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111      756900  999716913

On an Opteron 2435 (x86-64):

0000   999995893       1901
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        4107  999998099

Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

EDIT 2: How to make the test program fail

Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:

0000   999998634       5990
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          1  Not a single memory access!
1101           0          0
1110           0          0
1111        1366  999994009

So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

EDIT 3: ASM for the thread functions, on request of "GJ."

Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:


.globl thread2
        .type   thread2, @function
thread2:
.LFB537:
        .cfi_startproc
        movdqa  .LC3(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L11:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n2(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L11
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE537:
        .size   thread2, .-thread2
        .p2align 5,,31
.globl thread1
        .type   thread1, @function
thread1:
.LFB536:
        .cfi_startproc
        pxor    %xmm1, %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L15:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n1(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L15
        xorl    %eax, %eax
        ret
        .cfi_endproc

and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:


.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

这篇关于SSE指令:哪些CPU可以做原子16B内存操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆