lock cmpxchg 无法按核心顺序执行线程 [英] lock cmpxchg fails to execute threads in core order

查看:76
本文介绍了lock cmpxchg 无法按核心顺序执行线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下 64 位 NASM 代码使用锁 cmpxchg 以内核顺序获取每个内核,执行一些代码,然后使用 xchg 重置内核编号变量,以便下一个内核可以执行代码.每个内核的内核编号存储在 rbx 中——四个内核编号为 0、8、16 和 24.变量 [spin_lock_core] 从零开始,当每个内核完成时,它在最后一行将内核编号更新为 8xchg [spin_lock_core],rax.

The following 64-bit NASM code uses lock cmpxchg to take each core in core order, execute some code, then reset the core number variable using xchg so the next core can execute the code. The core number for each core is stored in rbx -- the four cores are numbered 0, 8, 16 and 24. The variable [spin_lock_core] starts at zero and when each core is finished it updates the core number by 8 at the final line xchg [spin_lock_core],rax.

Spin_lock:
xor rax,rax
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock

; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
;jmp label_899

mov rax,rbx
add rax,8
xchg [spin_lock_core],rax

但是在代码到达 xchg [spin_lock_core] 之前,rax 第一个核心循环退出程序(jmp label_899),这应该会导致其他线程冻结,因为它们将等待 [spin_lock_core] 变量更新,这永远不会发生.但相反,所有四个内核都写入输出数组 extra_test_array,该数组在程序退出时显示在终端上.换句话说,在内核编号更新之前,这无法停止内核.

But before the code reaches xchg [spin_lock_core],rax the first core loops out of the program (jmp label_899), which should cause the other threads to freeze because they would be waiting for the [spin_lock_core] var to be updated, which never happens. But instead all four cores are written to the output array extra_test_array, which is displayed on the terminal when the program exits. In other words, this fails to stop the cores until the core number is updated.

完整的、最小的代码如下(在这种情况下 NASM 可以是最小的).代码是为共享对象编写的,如果它得到一个输入数组,它是可重现的(如所写,输入数组是 int 还是 float 无关紧要):

The full, minimal code is below (as minimal as NASM can be in this case). The code is written for a shared object, and it's reproducible if it gets an input array (as written it doesn't matter if the input array is int or float):

; Header Section
[BITS 64]

[default rel]

global Main_Entry_fn
extern pthread_create, pthread_join, pthread_exit, pthread_self,    sched_getcpu
global FreeMem_fn
extern malloc, realloc, free
extern sprintf

section .data align=16
X_ctr: dq 0
data_master_ptr: dq 0
initial_dynamic_length: dq 0
XMM_Stack: dq 0, 0, 0, 0, 0, 0, 0
ThreadID: dq 0
X_ptr: dq 0
X_length: dq 0
X: dq 0
collect_ptr: dq 0
collect_length: dq 0
collect_ctr: dq 0
even_squares_list_ptrs: dq 0, 0, 0, 0
even_squares_list_ctr: dq 0
even_squares_list_length: dq 0
Number_Of_Cores: dq 32
pthread_attr_t: dq 0
pthread_arg: dq 0
Join_Ret_Val: dq 0
tcounter: dq 0
sched_getcpu_array: times 4 dq 0
ThreadIDLocked: dq 0
spin_lock_core: dq 0
extra_test_array: dq 0

; __________

section .text

Init_Cores_fn:

; _____
; Create Threads

label_0:

mov rdi,ThreadID            ; ThreadCount
mov rsi,pthread_attr_t  ; Thread Attributes
mov rdx,Test_fn         ; Function Pointer
mov rcx,pthread_arg
call pthread_create wrt ..plt

mov rdi,[ThreadID]      ; id to wait on
mov rsi,Join_Ret_Val        ; return value
call pthread_join wrt ..plt

mov rax,[tcounter]
add rax,8
mov [tcounter],rax
mov rbx,[Number_Of_Cores]
cmp rax,rbx
jl label_0

; _____

jmp label_900 ; All threads return here, and exit

; ______________________________________

Test_fn:

; Get the core number
call sched_getcpu wrt ..plt
mov rbx,8 ; multiply by 8
mul rbx
push rax

pop rax
mov rbx,rax
push rax

Spin_lock:
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock

; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
jmp label_899

mov rax,rbx
add rax,8
xchg [spin_lock_core],rax

;__________

label_899:

pop rax

ret

; __________

label_900:

mov rdi,extra_test_array ;audit_array
mov rax,rdi

ret

;__________
;Free the memory

FreeMem_fn:

;The pointer is passed back in rcx (of course)

sub rsp,40
call free wrt ..plt
add rsp,40
ret

; __________
; Main Entry


Main_Entry_fn:
push rdi
push rbp
push rbx
push r15
xor r15,r15
push r14
xor r14,r14
push r13
xor r13,r13
push r12
xor r12,r12
push r11
xor r11,r11
push r10
xor r10,r10
push r9
xor r9,r9
push r8
xor r8,r8
movsd [XMM_Stack+0],xmm13
movsd [XMM_Stack+8],xmm12
movsd [XMM_Stack+16],xmm11
movsd [XMM_Stack+24],xmm15
movsd [XMM_Stack+32],xmm14
movsd [XMM_Stack+40],xmm10
mov [X_ptr],rdi
mov [data_master_ptr],rsi
; Now assign lengths
lea rdi,[data_master_ptr]
mov rbp,[rdi]
xor rcx,rcx
movsd xmm0,qword[rbp+rcx]
cvttsd2si rax,xmm0
mov [X_length],rax
add rcx,8

; __________
; Write variables to assigned registers

mov r15,0
lea rdi,[rel collect_ptr]
mov r14,qword[rdi]
mov r13,[collect_ctr]
mov r12,[collect_length]
lea rdi,[rel X_ptr]
mov r11,qword[rdi]
mov r10,[X_length]

; __________

call Init_Cores_fn

movsd xmm10,[XMM_Stack+0]
movsd xmm14,[XMM_Stack+8]
movsd xmm15,[XMM_Stack+16]
movsd xmm11,[XMM_Stack+24]
movsd xmm12,[XMM_Stack+32]
movsd xmm13,[XMM_Stack+40]
pop r8
pop r9
pop r10
pop r11
pop r12
pop r13
pop r14
pop r15
pop rbx
pop rbp
pop rdi
ret

在更新 [spin_lock_core] 变量之前,指令lock cmpxchg"应该会失败,但它不会这样做.

The instruction "lock cmpxchg" should fail until the [spin_lock_core] variable is updated, but it doesn't do that.

感谢您在理解为什么 lock cmpxchg 不会阻止内核零之后的内核在此代码区域中触发的任何帮助.

Thanks for any help in understanding why lock cmpxchg doesn't prevent the cores after core zero from firing in this area of code.

更新:其他研究表明,在 Spin_lock: 部分的顶部需要 xor rax,rax.当我插入那行时,它是这样的:

UPDATE: other research shows that xor rax,rax is needed at the top of the Spin_lock: section. When I insert that line, it reads like this:

Spin_lock:
xor rax,rax
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock

有了这个更改,它就如预期的那样冻结了.但是当我删除行 jmp label_899 时它仍然冻结,但它不应该这样做.

With that change it freezes, as expected. But when I remove the line jmp label_899 it still freezes, but it shouldn't do that.

编辑 122219:

根据昨天对这个问题的评论,我将自旋锁代码修改为 (1) 消除原子操作以支持更快的 mov 和 cmp 指令,(2) 为每个内核分配唯一的内存位置,以及 (3) 分开内存位置 > 256 字节以避免内存在同一缓存行上.

Based on the comments on this question yesterday, I revised the spinlock code to (1) eliminate atomic operations in favor of faster mov and cmp instructions, (2) assign a unique memory location to each core, and (3) separate the memory locations by > 256 bytes to avoid memory on the same cache line.

当前一个内核完成时,每个内核的内存位置将更改为 1.当每个内核完成时,它会将自己的内存位置设置回 0.

Each core's memory location will be changed to 1 when the previous core is finished. When each core finishes, it sets its own memory location back to 0.

如果我在自旋锁之前循环了所有其他核心,代码将成功执行核心 0.当我让所有四个内核都运行通过自旋锁时,程序再次挂起.

The code successfully executes core 0 IF I have all other cores loop out before the spinlock. When I let all four cores run through the spinlock, the program again hangs.

我已经验证,当前一个内核完成时,每个单独的内存位置都设置为 1.

I've verified that each separate memory location is set to 1 when the previous core is finished.

这是更新的自旋锁部分:

Here's the updated spinlock section:

section .data
spin_lock_core: times 140 dq 0
spin_lock_core_offsets: dq 0,264,528,792

section .text

; Calculate the offset to spin_lock_core
mov rbp,spin_lock_core
mov rdi,spin_lock_core_offsets
mov rax,[rdi+rbx]
add rbp,rax

; ________

Spin_lock:
pause
cmp byte[rbp],1
jnz Spin_lock

xor rax,rax
mov [rbp],rax ; Set current memory location to zero

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rdx
mov rcx,rax

; Loop out if this is the last core
mov rax,rbx
add rax,8
cmp rax,[Number_Of_Cores]
jge label_899

; Set next core to 1 by adding 264 to the base address
add rbp,264
mov rax,1
mov [rbp],rax

为什么这段代码还是挂了?

Why does this code still hang?

推荐答案

我认为您根本不应该为此使用 cmpxchg.试试这个:

I don't think you should use cmpxchg for this at all. Try this:

Spin_lock:
pause
cmp [spin_lock_core],rbx
jnz Spin_lock

; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
;jmp label_899

lea rax,[rbx+8]
mov [spin_lock_core],rax

这篇关于lock cmpxchg 无法按核心顺序执行线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆