在x86_64平台上是否需要rdtsc的mfence? [英] Is mfence for rdtsc necessary on x86_64 platform?
问题描述
unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);
上面的代码中的
mfence
,有必要吗?
mfence
in the above code, is it necessary?
根据我的测试,找不到cpu重新排序.
Based on my test, cpu reorder is not found.
下面包含测试代码的片段.
The fragment of test code is included below.
inline uint64_t clock_cycles() {
unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);
推荐答案
使用rdtsc
执行明智的测量所需的是序列化指令.
What you need to perform a sensible measurement with rdtsc
is a serializing instruction.
众所周知,很多人在使用cpuid
之前 rdtsc
.
rdtsc
需要从上方的 和下方的 进行序列化(阅读:必须停用所有说明,并且必须在测试代码开始之前将其停用).
As it is well known, a lot of people use cpuid
before rdtsc
.
rdtsc
needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).
不幸的是,第二个条件经常被忽略,因为cpuid
对于此任务来说是一个非常糟糕的选择(它掩盖了rdtsc
的输出).
当人们寻找替代品时,人们会认为名称中带有"fence"的指令会起作用,但这也是不正确的.直接来自英特尔:
Unfortunately the second condition is often neglected because cpuid
is a very bad choice for this task (it clobbers the output of rdtsc
).
When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:
MFENCE不会序列化指令流.
MFENCE does not serialize the instruction stream.
lfence
.
An instruction that is almost serializing and will do in any measurement where previous stores don't need to complete is lfence
.
简而言之,lfence
确保在任何先前的指令在本地完成之前,没有新的指令开始.请参阅我的答案,以获取有关位置的详细说明. br>
它也不会像mfence
那样耗尽存储缓冲区,也不会像cpuid
那样浪费寄存器.
Simply put, lfence
makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.
It also doesn't drain the Store Buffer like mfence
does and doesn't clobbers the registers like cpuid
does.
因此,lfence / rdtsc / lfence
是比mfence / rdtsc
更好的指令序列,其中mfence
几乎没有用,除非您明确希望先前的存储在测试开始/结束之前完成(但不能在rdstc
之前完成)被执行!).
So lfence / rdtsc / lfence
is a better crafted sequence of instructions than mfence / rdtsc
, where mfence
is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not before rdstc
is executed!).
如果您检测到重新排序的测试为assert(t2 > t1)
,那么我相信您将不进行任何测试.
省略return
和可能会或可能不会阻止CPU及时看到第二个rdtsc
进行重新排序的调用,即使有一个rdtsc
,CPU也不太可能(尽管有可能!)对两个rdtsc
进行重新排序.在另一个之后.
If your test to detect reordering is assert(t2 > t1)
then I believe you will test nothing.
Leaving out the return
and the call that may or may not prevent the CPU from seeing the second rdtsc
in time for a reorder, it is unlikely (though possible!) that the CPU will reorder two rdtsc
even if one is right after the other.
想象一下,我们有一个rdtsc2
像rdtsc
一样 ,但是写的是ecx:ebx
1 .
Imagine we have a rdtsc2
that is exactly like rdtsc
but writes ecx:ebx
1.
执行
rdtsc
rdtsc2
很有可能是ecx:ebx > edx:eax
,因为CPU没有没有理由在rdtsc
之前执行rdtsc2
.
重新排序并不意味着随机排序,而是意味着如果当前指令无法执行,则寻找其他指令.
但是rdtsc
不依赖任何先前的指令,因此当OoO内核遇到它时,它不太可能被延迟.
但是,内部微体系结构的特殊细节可能会使我的论文无效,因此在我之前的声明中可能使用 这个词.
is highly likely that ecx:ebx > edx:eax
because the CPU has no reason to execute rdtsc2
before rdtsc
.
Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But rdtsc
has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.
However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.
1 我们不需要此更改的指令:可以重命名寄存器,但是如果您不熟悉它,则将有所帮助.
1 We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.
这篇关于在x86_64平台上是否需要rdtsc的mfence?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!