NASM Windows中的多核:线程随机执行 [英] Multicore in NASM Windows: threads execute randomly

查看:117
本文介绍了NASM Windows中的多核:线程随机执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows的NASM(64位)中有代码,可以在四核Windows x86-64机器上运行四个同时线程(每个线程分配给一个单独的内核).

I have code in NASM (64 bit) in Windows to run four simultaneous threads (each assigned to a separate core) on a four-core Windows x86-64 machine.

线程是在循环中创建的.创建线程后,它将调用WaitForMultipleObjects来协调线程.调用的函数是Test_Function(请参见下面的代码).

The threads are created in a loop. After thread creation, it calls WaitForMultipleObjects to coordinate the threads. The function to call is Test_Function (see code below).

每个线程(核心)在大型数组上执行Test_Function.第一核心从数据元素零开始,第二核心从1开始,第三核心从2开始,第四核心从3开始,并且每个核心递增四(例如0、4、8、12).

Each thread (core) executes Test_Function across a large array. The first core starts at data element zero, the second core starts at 1, the third core starts at 2, the fourth core starts at 3, and each core increments by four (e.g., 0, 4, 8, 12).

在Test_Function中,我创建了一个小型测试程序,该程序将输入数据值之一写入对应于其起始字节的位置,以验证我已成功创建了四个线程并且它们返回了正确的数据.

In Test_Function I created a small test program that writes one of the input data values to the location corresponding to its startbyte, to verify that I have successfully created four threads and they return the correct data.

每个线程都应写入步幅值(32),但测试表明这四个字段是随机填充的,其中一些字段显示为零.如果我重复测试多次,会发现字段值32不一致(其他字段始终显示为0).那可能是WaitForMultipleObjects的副作用,但是我在文档中没有看到任何可证实这一点的东西.

Each thread should write the stride value (32), but the test shows that the four fields are filled in randomly, with some fields showing as zero. If I repeat the test multiple times, I see there is no consistency to which fields will have the value 32 (the others always show as 0). That could be a side effect of WaitForMultipleObjects, but I haven't seen anything in the docs to confirm that.

另外,WaitForMultipleObjects等待由CreateThread返回的ThreadHandles.当我检查ThreadHandles数组时,它始终显示为:268444374、32、1652、1584.只有第一个元素看起来像句柄的大小,其他元素看起来不像句柄值.

Also, WaitForMultipleObjects waits on the ThreadHandles returned by CreateThread; when I examine the ThreadHandles array, it always shows like this: 268444374, 32, 1652, 1584. Only the first element looks like the size of a handle, the others do not look like handle values.

一种可能是在堆栈上传递的两个参数可能不在正确的位置:

One possibility is that the two parameters passed on the stack may not be in the correct locations:

mov rax,0
mov [rsp+40],rax            ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax            ; ThreadID

根据文档,ThreadCount应该是一个指针.当我将行更改为mov rax,ThreadCount(指针值)时,程序崩溃.当我将其更改为:

According to the docs, ThreadCount should be a pointer. When I change the line to mov rax,ThreadCount (the pointer value), the program crashes. When I change it to:

mov rax,0
mov [rsp+32],rax            ; use default creation flags
mov rax,ThreadCount
mov [rsp+40],rax            ; ThreadID

现在它可以可靠地处理第一个线程,但不能可靠地处理线程2-4.

now it reliably processes the first thread, but not threads 2-4.

因此,最重要的是正在创建线程,但是它们是随机执行的,有些线程根本没有执行,没有特定的顺序.当我更改CreateThread参数(如上所示)时,将执行第一个线程,但不执行线程2-4.

So the bottom line is the threads are being created but they execute randomly, with some threads not executing at all, in no particular order. When I change the CreateThread parameters (as shown above) the first thread executes, but not threads 2-4.

这是显示相关部分的测试代码.如果需要一个可复制的示例,我可以准备一个.

Here is the test code showing the relevant parts. If a reproducible example is needed, I can prepare one.

感谢任何想法.

Init_Cores_fn:
; EACH OF THE CORES CALLS Test_Function AND EXECUTES THE WHOLE PROGRAM.  
; WE PASS THE STARTING BYTE (0, 8, 16, 24) AND THE "STRIDE" = NUMBER OF CORES.  
; ON RETURN, WE SYNCHRONIZE ANY DATA.  ON ENTRY TO EACH CORE, SET THE REGISTERS

; Populate the ThreadInfo array with vars to pass
; ThreadInfo: length, startbyte, stride, vars into registers on entry to each core
mov rdi,ThreadInfo
mov rax,ThreadInfoLength
mov [rdi],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Register Vars
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10

mov rbp,rsp ; preserve caller's stack frame
sub rsp,56 ; Shadow space

; _____

label_0:

mov rdi,ThreadInfo
mov rax,[FirstByte]
mov [rdi+8],rax ; 0, 8, 16, or 24

; _____
; Create Threads

mov rcx,0               ; lpThreadAttributes (Security Attributes)
mov rdx,0               ; dwStackSize
mov r8,Test_Function        ; lpStartAddress (function pointer)
mov r9,ThreadInfo       ; lpParameter (array of data passed to each core)

mov rax,0
mov [rsp+40],rax            ; use default creation flags
mov rax,[ThreadCount]
mov [rsp+32],rax            ; ThreadID

call CreateThread

; Move the handle into ThreadHandles array (returned in rax)
mov rdi,ThreadHandles
mov rcx,[FirstByte]
mov [rdi+rcx],rax

mov rax,[FirstByte]
add rax,8
mov [FirstByte],rax

mov rax,[ThreadCount]
add rax,1
mov [ThreadCount],rax

mov rbx,4
cmp rax,rbx
jl label_0

; _____
; Wait

mov rcx,rax         ; number of handles
mov rdx,ThreadHandles       ; pointer to handles array
mov r8,1                ; wait for all threads to complete
mov r9,1000         ; milliseconds to wait

call WaitForMultipleObjects

; _____

;[ Code HERE to do cleanup if needed after the four threads finish ]

mov rsp,rbp
jmp label_900

; __________________
; The function for all threads to call

Test_Function:

; Populate registers
mov rdi,rcx
mov rax,[rdi]
mov r15,[rdi+24]
mov rax,[rdi+8] ; start byte
mov r13,[rdi+40]
mov r12,[rdi+48]
mov r10,[rdi+56]
xor r11,r11
xor r9,r9
pxor xmm15,xmm15
pxor xmm15,xmm14
pxor xmm15,xmm13

; Now test it - BUT the first thread does not write data
mov rcx,[rdi+8] ; start byte
mov rax,[rdi+16] ; stride
cvtsi2sd xmm0,rax
movsd [r15+rcx],xmm0
ret

推荐答案

我解决了这个问题,这是解决方案.在敦促我使用高级语言之前,雷蒙德·陈(Raymond Chen)在上面的评论中提到了这一点,但是直到今天我才明白这一点.我将发布此答案,以便以后在汇编语言(或任何其他语言)中遇到相同问题的任何人都可以轻松访问和理解,因为雷蒙德的评论(我刚刚赞成)现在已被上面的其他评论所掩盖.

I solved this problem, and here is the solution. Raymond Chen alluded to this in the comments above before urging me to use a higher level language, but I didn't understand it until today. I am posting this answer so it's easily accessible and understood by anyone who has the same problem in assembly language (or any other language) in the future because Raymond's comment (which I just upvoted) is now buried in the other comments above.

ThreadInfo数组,在此处作为第四个参数传递给CreateThread(在Windows中为r9).每个核心必须具有其自己的ThreadInfo单独副本.在我的应用程序中,除了StartByte参数(在rdi + 8处)之外,ThreadInfo中的数据都相同.相反,我为每个核心(ThreadInfo1、2、3和4)创建了一个单独的ThreadInfo数组,并将指针传递给相应的ThreadInfo数组.

The ThreadInfo array, passed here as the fourth parameter to CreateThread (in r9 for Windows). Each core must have its own separate copy of ThreadInfo. In my application, the data in ThreadInfo are all the same except for the StartByte parameter (at rdi+8). Instead, I created a separate ThreadInfo array for each core (ThreadInfo1, 2, 3, and 4) and pass a pointer to the corresponding ThreadInfo array.

我在我的应用程序中将其实现为对以下dup函数的调用,但它也可以通过其他方式实现:

I implemented it in my application as a call to the following dup function but it could be implemented other ways as well:

DupThreadInfo:
mov rdi,ThreadInfo2
mov rax,8
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
; _____

mov rdi,ThreadInfo3
mov rax,0
mov [rdi],rax       ; length (number of vars into registers plus 3 elements)
mov rax,16
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10

mov rdi,ThreadInfo4
mov rax,0
mov [rdi],rax       ; length (number of vars into registers plus 3 elements)
mov rax,24
mov [rdi+8],rax
mov rax,[stride]
mov [rdi+16],rax    ; 8 x number of cores (32 in this example)
; Vars (these registers are populated on main entry)
mov [rdi+24],r15
mov [rdi+32],r14
mov [rdi+40],r13
mov [rdi+48],r12
mov [rdi+56],r10
ret

因为除了第二个元素外,ThreadInfo数组中的所有数据都相同,所以更有效的方法是传递一个2元素数组,其中第一个元素是StartByte,第二个元素是指向该元素的指针静态ThreadInfo数组.当我们使用四个以上的内核时,这一点尤其重要,因为DupThreadInfo节的长度会不必要地长.该解决方案可以避免打电话,但是我还没有实现.

Because all data in the ThreadInfo arrays are the same except the second element, a more efficient way to do this would be to pass a 2-element array where the first element is the StartByte and the second element is a pointer to the static ThreadInfo array. That's especially important when we are working with more than four cores because the DupThreadInfo section would be needlessly long. That solution would avoid a call, but I haven't implemented that yet.

这篇关于NASM Windows中的多核:线程随机执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆