对此CMPXCHG16B指令的仿真有什么问题? [英] What is wrong with this emulation of CMPXCHG16B instruction?

查看:175
本文介绍了对此CMPXCHG16B指令的仿真有什么问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一个地方运行使用CMPXCHG16B指令的二进制程序,很不幸,我的Athlon 64 X2 3800+不支持它.很棒,因为我将其视为编程挑战.该指令似乎很难通过跳洞来实现,所以这就是我所做的,但是某些事情不起作用,程序只是陷入了循环.也许有人可以告诉我我是否错误地实施了CMPXCHG16B?

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?

首先,我要模拟的实际机器代码是:

Firstly the actual piece of machine code that I'm trying to emulate is this:

f0 49 0f c7 08                lock cmpxchg16b OWORD PTR [r8]

摘录自Intel手册,描述CMPXCHG16B:

Excerpt from Intel manual describing CMPXCHG16B:

将RDX:RAX与m128进行比较.如果相等,则设置ZF并将RCX:RBX加载到m128中. 否则,清除ZF并将m128加载到RDX:RAX中.

Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128. Else, clear ZF and load m128 into RDX:RAX.

首先,我用仿真过程将指令的所有5个字节替换为跳转到代码洞穴,幸运的是,该跳转恰好占用了5个字节!跳转实际上是call指令e8,但也可能是jmp e9,两者均有效.

First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.

e8 96 fb ff ff            call 0xfffffb96(-649)

这是一个相对跳转,带有以二进制补码编码的32位有符号偏移量,该偏移量指向相对于下一条指令的地址的代码陷阱.

This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.

接下来我要跳转到的仿真代码:

Next the emulation code I'm jumping to:

PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET

我个人对此感到满意,并且认为它与手册中给出的功能规格相符.它将堆栈和两个寄存器r10r11恢复为其原始顺序,然后恢复执行. las,它行不通!那是代码的工作原理,但是该程序的行为就像是在等待小费和燃烧的电能.这表明我的模拟并不完美,并且我无意间打破了它的循环.你有什么不对吗?

Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?

我注意到这是它的原子变体-属于lock前缀.我希望除了争执之外,我做错了其他事情.还是也有一种模仿原子性的方法?

I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?

推荐答案

无法模拟lock cmpxchg16b .如果将对目标地址的所有访问都与一个单独的锁进行同步是有可能的,但这包括所有其他指令,包括对对象一半的非原子存储,以及原子的读取-修改-写入(如xchglock cmpxchglock addlock xadd)和16字节对象的一半(或其他部分).

It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.

您可以像在此一样模拟cmpxchg16b(无lock),并使用@Fifoernik的答案中的错误修正.这是一个有趣的学习练习,但在实践中不是很有用,因为使用cmpxchg16b的实际代码始终将其与lock前缀一起使用.

You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from @Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.

大多数情况下,非原子替换将起作用,因为很少有其他内核的高速缓存行无效到达两个相邻指令之间的较小时间窗口. 这并不意味着它是安全的,它只是意味着它偶尔偶尔失败时真的很难调试.如果您只是想让游戏自己使用,并且可以接受偶尔的锁定/错误,这可能会很有用.对于任何对正确性很重要的事情,您都不走运.

A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.

那MFENCE呢?似乎是我所需要的.

What about MFENCE? Seems to be what I need.

在装入和存储之前,之后或之间的

MFENCE不会阻止另一个线程看到半写的值(撕裂"),或者防止代码在您做出比较后的决定之后修改数据成功,但在存储之前.这可能会缩小漏洞窗口的范围,但无法将其关闭,因为MFENCE仅阻止对我们自己的商店和装载的全局可见性进行重新排序.在加载之后但在商店之前,它不能阻止另一个核心的商店对我们可见.这需要一个原子的读-修改-写总线周期,这就是lock ed指令的作用.

MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.

进行两个8字节的原子比较交换将解决漏洞窗口问题,但仅对每个漏洞分别进行处理,而不会出现撕裂"问题.

Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.

16B原子加载/存储解决了撕裂问题,但不能解决加载与存储之间的原子性问题.它是可以在某些硬件上与SSE一起使用,但x86 ISA不能保证它是原子的方式自然对齐的8B装载和存储是.

Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.

Xen虚拟机有一个x86模拟器,我猜想VM在一台机器上启动并迁移到功能较弱的硬件的情况.它通过采用全局锁定来模拟lock cmpxchg16b,因为没有其他方法.如果有 一种可以正确"模拟它的方法,我相信Xen会做到这一点.

The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.

此邮件列表主题,当一个内核上的仿真版本与另一内核上的非仿真指令访问相同的内存时,Xen的解决方案仍然无法正常工作. (本机版本不遵守全局锁定).

As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).

另请参见 Xen邮件列表上的此补丁更改lock cmpxchg8b仿真以同时支持lock cmpxchg8block cmpxchg16b.

See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.

根据emulate cmpxchg16b的搜索结果,我还发现KVM的x86仿真器也不支持cmpxchg16b.

I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.

我认为所有这些都很好地证明了我的分析是正确的,并且不可能安全地进行模拟.

I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.

这篇关于对此CMPXCHG16B指令的仿真有什么问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆