如何在 x86_64 上准确地对未对齐的访问速度进行基准测试? [英] How can I accurately benchmark unaligned access speed on x86_64?

查看:34
本文介绍了如何在 x86_64 上准确地对未对齐的访问速度进行基准测试?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个答案中,我已经说过未对齐的访问几乎与长时间对齐访问的速度相同(在 x86/x86_64 上).我没有任何数字来支持这个说法,所以我为它创建了一个基准.

您是否发现此基准测试有任何缺陷?你能改进它吗(我的意思是,增加 GB/秒,以便更好地反映事实)?

#include #include 模板 <int N>__attribute__((noinline))void loop32(const char *v) {for (int i=0; i(((reinterpret_cast(new char[N+32])+15)&~15));for (int i=0; i(数据+1);}long long int t2 = t();for (int i=0; i(数据+1);}long long int t2 = t();for (int i=0; i(数据+1);}long long int t4 = t();printf(128 位,缓存:对齐:%8.4f GB/秒未对齐:%8.4f GB/秒,差异:%0.3f%%\n",(double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);printf(128 位,内存:对齐:%8.4f GB/秒未对齐:%8.4f GB/秒,差异:%0.3f%%\n",(double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);}}

解决方案

计时方法.我可能会设置它,以便通过命令行参数选择测试,因此我可以使用 perf stat ./unaligned-test 计时,并获得性能计数器结果,而不仅仅是墙-每个测试的时钟时间.这样,我就不必关心涡轮增压/节能,因为我可以在核心时钟周期内进行测量.(与 gettimeofday/rdtsc 参考周期不同,除非您禁用 Turbo 和其他频率变化.)


您只是在测试吞吐量,而不是延迟,因为没有任何负载是相关的.

你的缓存数会比你的内存数差,但你可能不会意识到这是因为你的缓存数可能是由于 分离加载寄存器,用于处理跨越缓存线边界的加载/存储.对于顺序读取,缓存的外层仍然总是会看到对整个缓存线的一系列请求.只有从 L1D 获取数据的执行单元才需要关心对齐.要测试非缓存情况的未对齐情况,您可以进行分散加载,因此缓存行拆分需要将两个缓存行带入 L1.

缓存行是 64 字节宽1,因此您总是在测试缓存行拆分和缓存行内访问的混合.测试始终拆分的负载会使拆分负载的微体系结构资源遇到瓶颈.(实际上,取决于您的 CPU,缓存获取宽度 可能比行大小更窄.最近的 Intel CPU 可以从缓存行内部获取任何未对齐的块,但那是因为它们有特殊的硬件来加快速度.其他 CPU 可能只有在在自然对齐的 16 字节块或其他内容中获取.@BeeOnRope 说 AMD CPU 可能关心 16 字节和 32 字节的边界.)

您根本没有测试商店 → 负载转发.对于现有的测试,以及可视化不同对齐结果的好方法,请参阅此 stuffedcow.net 博客文章:x86 处理器中的存储到负载转发和内存消歧.

通过内存传递数据是一个重要的用例,未对齐 + 缓存行拆分会干扰某些 CPU 上的存储转发.要正确测试这一点,请确保测试不同的错位,而不仅仅是 1:15(向量)或 1:3(整数).(您目前仅测试相对于 16B 对齐的 +1 偏移).

我忘记了它是仅用于存储转发还是用于常规加载,但是当加载跨缓存线边界(8:8 向量,也可能是 4:4 或2:2 整数拆分).你应该测试一下.(我可能会考虑 P4 lddqu 或 Core 2 movqdu)

Intel 的优化手册 有很大的未对齐表格与存储转发从广泛存储到完全包含在其中的狭窄重新加载.在某些 CPU 上,当宽存储自然对齐时,这在更多情况下有效,即使它不跨越任何缓存线边界.(可能在 SnB/IvB 上,因为它们使用带 16B 组的银行级 L1 缓存,并且拆分这些缓存会影响存储转发.

我没有重新检查手册,但如果您真的想通过实验进行测试,那是您应该寻找的东西.)


这让我想起了,未对齐的负载更有可能在 SnB/IvB 上引发缓存组冲突(因为一个负载可以接触两个组).但是您不会从单个流中看到这种加载,因为在一个周期内两次访问 same 行中的同一个 bank 是可以的.它只是在不同行中访问同一个银行,而这不可能发生在同一个周期中.(例如,当两次内存访问相隔 128 字节的倍数时.)

您没有尝试测试 4k 分页.它们比常规缓存行拆分慢,因为它们还需要两次 TLB 检查.(不过,Skylake 将它们从约 100 个周期的损失改进到了超出正常负载使用延迟的约 5 个周期的损失)

您未能在对齐的地址上测试 movups,因此您不会检测到 movupsmovaps 在 Core 2 及更早版本上,即使内存在运行时对齐.(我认为即使在 Core 2 中,未对齐的 mov 加载最多 8 个字节也可以,只要它们不跨越缓存线边界.IDK 您必须查看 CPU 的年龄查找缓存行中非矢量加载的问题.这将是一个只有 32 位的 CPU,但您仍然可以使用 MMX 或 SSE,甚至 x87 测试 8 字节加载.P5 Pentium 和更高版本保证对齐的 8 字节加载/存储是原子的,但是 P6 和更新版本保证缓存的 8 字节加载/存储是原子的,只要没有跨越缓存线边界.与 AMD 不同,即使在可缓存的内存中,8 字节边界对于原子性保证也很重要.为什么在自然对齐的变量上进行整数赋值x86 上的 ble 原子?)

查看 Agner Fog 的内容以了解有关未对齐加载如何变慢的更多信息,以及编写测试来练习这些案例.实际上,Agner 可能不是这方面的最佳资源,因为他的微体系结构指南主要侧重于通过管道获取 uop.只是简要提及缓存行拆分的成本,没有深入了解吞吐量与延迟.

另见:Cacheline splits, take二,来自 Dark Shikari 的博客(x264 首席开发人员),谈到 Core2 上的未对齐加载策略:检查对齐并为块使用不同的策略是值得的.


脚注:

  1. 如今,64B 缓存行是一个安全的假设.Pentium 3 和更早版本有 32B 线.P4 有 64B 条线,但它们经常以 128B 对齐的对转移.以为我记得读过 P4 实际上在 L2 或 L3 中有 128B 线,但这也许只是成对传输的 64B 线的失真.7-CPU 明确表示 P4 130nm 的两个缓存级别中的 64B 线.


另见uarch-bench结果适用于 Skylake.显然有人已经编写了一个测试程序来检查与缓存行边界相关的所有可能的错位.


##我在 Skylake 桌面版 (i7-6700k) 上的测试:

寻址模式会影响负载使用延迟,这与英特尔在其优化手册中的文档完全相同.我用整数 mov rax, [rax+...]movzx/sx 进行了测试(在这种情况下,使用加载的值作为索引,因为它太窄了成为一个指针).

;;;Linux x86-64 NASM/YASM 源.组装成静态二进制文件;;公共领域,最初由 peter@cordes.ca 撰写.;;分享和享受.如果它坏了,你可以保留两块.;;;这种在我测试和思考要测试的东西时成长;;;我留下了一些评论,但去掉了大部分并总结了这个代码块之外的结果;;;当我想到要测试的新东西时,我会编辑、保存并向上箭头我的 assemble-and-run shell 命令;;;然后将结果编辑为源中的注释..bss 节对齐 2 * 1<<20;2MB = 4096*512.在 .bss 中使用大页面,但在 .data 中不使用.我签入/proc//smapsbuf: resb 16 * 1<<20节.text全局_start_开始:mov esi, 128;mov edx, 64*123 + 8;mov edx, 64*123 + 0;mov edx, 64*64 + 0异或edx,edx;;RAX 指向 buf,16B 指向 2M 大页的最后 4k 页mov eax, buf + (2<<20)*0 + 4096*511 + 64*0 + 16mov ecx, 25000000%define ADDR(x) x ;SKL:4c;%define ADDR(x) x + rdx ;SKL:5c;%define ADDR(x) 128+60 + x + rdx*2 ;SKL:11c 缓存行拆分;%define ADDR(x) x-8 ;SKL:5c;%define ADDR(x) x-7 ;SKL:12c 用于 4k 分割(即使它位于大页面的中间);... 删除了更多内容和其他记录结果的评论%define dst raxmov [地址(rax)], dst对齐 32.环形:mov dst, [地址(rax)]mov dst, [地址(rax)]mov dst, [地址(rax)]mov dst, [地址(rax)]十二月循环异或edi,edimov eax,231系统调用

然后运行

asm-link load-use-latency.asm &&disas 负载使用延迟 &&perf stat -etask-clock,cycles,L1-dcache-loads,instructions,branchs -r4 ./load-use-latency+ yasm -felf64 -Worphan-labels -gdwarf2 load-use-latency.asm+ ld -o load-use-latency load-use-latency.o(反汇编输出,所以我的终端历史记录有性能结果的 asm)'./load-use-latency' 的性能计数器统计信息(4 次运行):91.422838 task-clock:u (msec) # 0.990 CPUs (+- 0.09% )400,105,802 个周期:u # 4.376 GHz ( +- 0.00% )100,000,013 L1-dcache-loads:u # 1093.819 M/sec ( +- 0.00% )150,000,039 条指令:u # 每个周期 0.37 insn ( +- 0.00% )25,000,031 个分支:u # 273.455 M/sec ( +- 0.00% )经过 0.092365514 秒的时间(+- 0.52%)

在这种情况下,我正在测试 mov rax, [rax],自然对齐,因此循环数 = 4*L1-dcache-loads.4c 延迟.我没有禁用涡轮增压或类似的东西.由于内核没有任何异常,因此内核时钟周期是最好的测量方法.

  • [base + 0..2047]:4c 负载使用延迟,11c 缓存行拆分,11c 4k 页拆分(即使在同一个大页面中).参见 是否有当 base+offset 与 base 位于不同的页面时会受到惩罚吗? 了解更多详情:如果 base+dispbase 位于不同的页面,必须重播加载 uop.
  • 任何其他寻址模式:5c 延迟、11c 缓存行拆分、12c 4k 拆分(即使在大页面内).这包括 [rax - 16].区别在于 disp8 与 disp32.

所以:大页面无助于避免分页惩罚(至少当两个页面在 TLB 中都很热时不会).缓存行拆分使寻址模式无关紧要,但快速"对于正常加载和分页加载,寻址模式的延迟降低了 1c.

4k 分割处理比以前好得多,请参阅 @harold 的数字,其中 Haswell 的 4k 分割延迟约为 32c.(旧的 CPU 可能比这更糟糕.我认为在 SKL 之前它应该是 ~100 个周期的惩罚.)

吞吐量(无论寻址模式如何),通过使用除 rax 以外的目的地来衡量,因此负载是独立的:

  • 无分裂:0.5c.
  • CL-split:1c.
  • 4k 分割:~3.8 到 3.9c(比 Skylake 之前的 CPU 好很多)

movzx/movsx(包括 WORD 拆分)的吞吐量/延迟与预期相同,因为它们是在加载端口中处理的(与某些 AMD CPU 不同,那里还有一个 ALU uop).

缓存线拆分负载从 RS(保留站)重放.uops_dispatched_port.port_2 + port_3 = 2x mov rdi, [rdi] 的计数器,在另一个使用基本相同循环的测试中.(这是一个依赖负载的情况,不受吞吐量限制.)直到 AGU 之后,您才能检测到拆分负载.

大概当加载 uop 发现它需要来自第二行的数据时,它会查找拆分寄存器(Intel CPU 用于处理拆分加载的缓冲区),并将所需的数据部分放入第一行进入那个分割区.并且还向 RS 发出需要重播的信号.(这是猜测.)

我认为即使在拆分上没有缓存线,拆分加载重放也应该在几个周期内发生(也许一旦加载端口向 RS 报告它是拆分,即在地址之后-一代).因此,拆分双方的需求负载请求可以同时进行.


另见在 IvyBridge 上的指针追逐循环中,来自附近依赖商店的奇怪性能影响.添加额外的负载会加快速度吗? 有关 uop 重播的更多信息.(但请注意,这适用于 依赖于负载的 uops,而不是负载 uop 本身.在该问答中,依赖的 uops 也主要是负载.)

缓存未命中加载不需要本身重放即可接受"准备好的传入数据,仅依赖于 uops.请参阅关于 加载操作是否在分派、完成或其他时间从 RS 解除分配?.这个 https://godbolt.org/z/HJF3BN i7-6700k 上的 NASM 测试用例显示相同无论 L1d 命中还是 L3 命中,分派的负载 uops 数.但是分配的 ALU uops 数量(不计算循环开销)从每个负载 1 个变为每个负载约 8.75 个.当加载数据可能从 L2 缓存到达时,调度器积极地调度消耗数据的 uops(然后在此之后非常积极地调度),而不是等待一个额外的周期来查看它是否到达.

我们还没有测试当有其他独立但更年轻的工作可以在输入绝对准备好的同一端口上完成时重放的积极性.


SKL 有两个硬件 page-walk 单元,这可能与 4k-split 性能的大幅提升有关.即使没有 TLB 未命中,大概较旧的 CPU 也必须考虑可能存在的事实.

有趣的是,4k 分割吞吐量是非整数.我认为我的测量具有足够的精度和可重复性来说明这一点.请记住,这是 每个 负载都是 4k 分割,并且没有其他工作在进行(除了在一个小的 dec/jnz 循环中).如果您在实际代码中使用过这种方法,那么您就做错了.

对于为什么它可能是非整数,我没有任何可靠的猜测,但很明显,对于 4k 分割,在微体系结构上必须发生很多事情.它仍然是一个缓存行拆分,它必须检查 TLB 两次.

In an answer, I've stated that unaligned access has almost the same speed as aligned access a long time (on x86/x86_64). I didn't have any numbers to back up this statement, so I've created a benchmark for it.

Do you see any flaws in this benchmark? Can you improve on it (I mean, to increase GB/sec, so it reflects the truth better)?

#include <sys/time.h>
#include <stdio.h>

template <int N>
__attribute__((noinline))
void loop32(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("mov     (%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x04(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x08(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x0c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x10(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x14(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x18(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x1c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x20(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x24(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x28(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x2c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x30(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x34(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x38(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x3c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x40(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x44(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x48(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x4c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x50(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x54(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x58(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x5c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x60(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x64(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x68(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x6c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x70(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x74(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x78(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x7c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x80(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x84(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x88(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x8c(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x90(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x94(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x98(%0), %%eax" : : "r"(v) :"eax");
        __asm__ ("mov 0x9c(%0), %%eax" : : "r"(v) :"eax");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop64(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("mov     (%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x08(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x10(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x18(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x20(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x28(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x30(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x38(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x40(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x48(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x50(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x58(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x60(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x68(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x70(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x78(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x80(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x88(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x90(%0), %%rax" : : "r"(v) :"rax");
        __asm__ ("mov 0x98(%0), %%rax" : : "r"(v) :"rax");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop128a(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("movaps     (%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x10(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x20(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x30(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x40(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x50(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x60(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x70(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x80(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movaps 0x90(%0), %%xmm0" : : "r"(v) :"xmm0");
        v += 160;
    }
}

template <int N>
__attribute__((noinline))
void loop128u(const char *v) {
    for (int i=0; i<N; i+=160) {
        __asm__ ("movups     (%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x10(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x20(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x30(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x40(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x50(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x60(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x70(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x80(%0), %%xmm0" : : "r"(v) :"xmm0");
        __asm__ ("movups 0x90(%0), %%xmm0" : : "r"(v) :"xmm0");
        v += 160;
    }
}

long long int t() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return (long long int)tv.tv_sec*1000000 + tv.tv_usec;
}

int main() {
    const int ITER = 10;
    const int N = 1600000000;

    char *data = reinterpret_cast<char *>(((reinterpret_cast<unsigned long long>(new char[N+32])+15)&~15));
    for (int i=0; i<N+16; i++) data[i] = 0;

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop32<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop32<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop32<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop32<N>(data+1);
        }
        long long int t4 = t();

        printf(" 32-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf(" 32-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop64<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop64<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop64<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop64<N>(data+1);
        }
        long long int t4 = t();

        printf(" 64-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf(" 64-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }

    {
        long long int t0 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop128a<N/100000>(data);
        }
        long long int t1 = t();
        for (int i=0; i<ITER*100000; i++) {
            loop128u<N/100000>(data+1);
        }
        long long int t2 = t();
        for (int i=0; i<ITER; i++) {
            loop128a<N>(data);
        }
        long long int t3 = t();
        for (int i=0; i<ITER; i++) {
            loop128u<N>(data+1);
        }
        long long int t4 = t();

        printf("128-bit, cache: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t1-t0)/1000, (double)N*ITER/(t2-t1)/1000, 100.0*(t2-t1)/(t1-t0)-100.0f);
        printf("128-bit,   mem: aligned: %8.4f GB/sec unaligned: %8.4f GB/sec, difference: %0.3f%%\n", (double)N*ITER/(t3-t2)/1000, (double)N*ITER/(t4-t3)/1000, 100.0*(t4-t3)/(t3-t2)-100.0f);
    }
}

解决方案

Timing method. I probably would have set it up so the test was selected by a command-line argument, so I could time it with perf stat ./unaligned-test, and get perf counter results instead of just wall-clock times for each test. That way, I wouldn't have to care about turbo / power-saving, since I could measure in core clock cycles. (Not the same thing as gettimeofday / rdtsc reference cycles unless you disable turbo and other frequency-variation.)


You're only testing throughput, not latency, because none of the loads are dependent.

Your cache numbers will be worse than your memory numbers, but you maybe won't realize that it's because your cache numbers may be due to bottlenecking on the number of split-load registers that handle loads/stores that cross a cache-line boundary. For sequential read, the outer levels of cache are still always just going to see a sequence of requests for whole cache lines. It's only the execution units getting data from L1D that have to care about alignment. To test misalignment for the non-cached case, you could do scattered loads, so cache-line splits would need to bring two cache lines into L1.

Cache lines are 64 bytes wide1, so you're always testing a mix of cache-line splits and within-a-cache-line accesses. Testing always-split loads would bottleneck harder on the split-load microarchitectural resources. (Actually, depending on your CPU, the cache-fetch width might be narrower than the line size. Recent Intel CPUs can fetch any unaligned chunk from inside a cache line, but that's because they have special hardware to make that fast. Other CPUs may only be at their fastest when fetching within a naturally-aligned 16 byte chunk or something. @BeeOnRope says that AMD CPUs may care about 16 byte and 32 byte boundaries.)

You're not testing store → load forwarding at all. For existing tests, and a nice way to visualize results for different alignments, see this stuffedcow.net blog post: Store-to-Load Forwarding and Memory Disambiguation in x86 Processors.

Passing data through memory is an important use case, and misalignment + cache-line splits can interfere with store-forwarding on some CPUs. To properly test this, make sure you test different misalignments, not just 1:15 (vector) or 1:3 (integer). (You currently only test a +1 offset relative to 16B-alignment).

I forget if it's just for store-forwarding, or for regular loads, but there may be less penalty when a load is split evenly across a cache-line boundary (an 8:8 vector, and maybe also 4:4 or 2:2 integer splits). You should test this. (I might be thinking of P4 lddqu or Core 2 movqdu)

Intel's optimization manual has big tables of misalignment vs. store-forwarding from a wide store to narrow reloads that are fully contained in it. On some CPUs, this works in more cases when the wide store was naturally-aligned, even if it doesn't cross any cache-line boundaries. (Maybe on SnB/IvB, since they use a banked L1 cache with 16B banks, and splits across those can affect store forwarding.

I didn't re-check the manual, but if you really want to test this experimentally, that's something you should be looking for.)


Which reminds me, misaligned loads are more likely to provoke cache-bank conflicts on SnB/IvB (because one load can touch two banks). But you won't see this loading from a single stream, because accessing the same bank in the same line twice in one cycle is fine. It's only accessing the same bank in different lines that can't happen in the same cycle. (e.g., when two memory accesses are a multiple of 128 bytes apart.)

You don't make any attempt to test 4k page-splits. They are slower than regular cache-line splits, because they also need two TLB checks. (Skylake improved them from a ~100 cycles penalty to a ~5 cycles penalty beyond the normal load-use latency, though)

You fail to test movups on aligned addresses, so you wouldn't detect that movups is slower than movaps on Core 2 and earlier even when the memory is aligned at runtime. (I think unaligned mov loads up to 8 bytes were fine even in Core 2, as long as they didn't cross a cache-line boundary. IDK how old a CPU you'd have to look at to find a problem with non-vector loads within a cache line. It would be a 32-bit only CPU, but you could still test 8 byte loads with MMX or SSE, or even x87. P5 Pentium and later guarantee that aligned 8 byte loads/stores are atomic, but P6 and newer guarantee that cached 8 byte loads/stores are atomic as long as no cache-line boundary is crossed. Unlike AMD, where 8 byte boundaries matter for atomicity guarantees even in cacheable memory. Why is integer assignment on a naturally aligned variable atomic on x86?)

Go look at Agner Fog's stuff to learn more about how unaligned loads can be slower, and cook up tests to exercise those cases. Actually, Agner may not be the best resource for that, since his microarchitecture guide mostly focuses on getting uops through the pipeline. Just a brief mention of the cost of cache-line splits, nothing in-depth about throughput vs. latency.

See also: Cacheline splits, take two, from Dark Shikari's blog (x264 lead developer), talking about unaligned load strategies on Core2: it was worth it to check for alignment and use a different strategy for the block.


Footnotes:

  1. 64B cache lines is a safe assumption these days. Pentium 3 and earlier had 32B lines. P4 had 64B lines but they were often transferred in 128B-aligned pairs. I thought I remembered reading that P4 actually had 128B lines in L2 or L3, but maybe that was just a distortion of 64B lines transferred in pairs. 7-CPU definitely says 64B lines in both levels of cache for a P4 130nm.


See also uarch-bench results for Skylake. Apparently someone has already written a tester that checks every possible misalignment relative to a cache-line boundary.


##My testing on Skylake desktop (i7-6700k):

Addressing mode affects load-use latency, exactly as Intel documents in their optimization manual. I tested with integer mov rax, [rax+...], and with movzx/sx (in that case using the loaded value as an index, since it's too narrow to be a pointer).

;;;  Linux x86-64 NASM/YASM source.  Assemble into a static binary
;; public domain, originally written by peter@cordes.ca.
;; Share and enjoy.  If it breaks, you get to keep both pieces.

;;; This kind of grew while I was testing and thinking of things to test
;;; I left in some of the comments, but took out most of them and summarized the results outside this code block
;;; When I thought of something new to test, I'd edit, save, and up-arrow my assemble-and-run shell command
;;; Then edit the result into a comment in the source.

section .bss

ALIGN   2 * 1<<20   ; 2MB = 4096*512.  Uses hugepages in .bss but not in .data.  I checked in /proc/<pid>/smaps
buf:    resb 16 * 1<<20

section .text
global _start
_start:
    mov     esi, 128

;   mov             edx, 64*123 + 8
;   mov             edx, 64*123 + 0
;   mov             edx, 64*64 + 0
    xor             edx,edx
   ;; RAX points into buf, 16B into the last 4k page of a 2M hugepage

    mov             eax, buf + (2<<20)*0 + 4096*511 + 64*0 + 16
    mov             ecx, 25000000

%define ADDR(x)  x                     ; SKL: 4c
;%define ADDR(x)  x + rdx              ; SKL: 5c
;%define ADDR(x)  128+60 + x + rdx*2   ; SKL: 11c cache-line split
;%define ADDR(x)  x-8                 ; SKL: 5c
;%define ADDR(x)  x-7                 ; SKL: 12c for 4k-split (even if it's in the middle of a hugepage)
; ... many more things and a block of other result-recording comments taken out

%define dst rax



        mov             [ADDR(rax)], dst
align 32
.loop:
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
        mov             dst, [ADDR(rax)]
    dec         ecx
    jnz .loop

        xor edi,edi
        mov eax,231
    syscall

Then run with

asm-link load-use-latency.asm && disas load-use-latency && 
    perf stat -etask-clock,cycles,L1-dcache-loads,instructions,branches -r4 ./load-use-latency

+ yasm -felf64 -Worphan-labels -gdwarf2 load-use-latency.asm
+ ld -o load-use-latency load-use-latency.o
 (disassembly output so my terminal history has the asm with the perf results)

 Performance counter stats for './load-use-latency' (4 runs):

     91.422838      task-clock:u (msec)       #    0.990 CPUs utilized            ( +-  0.09% )
   400,105,802      cycles:u                  #    4.376 GHz                      ( +-  0.00% )
   100,000,013      L1-dcache-loads:u         # 1093.819 M/sec                    ( +-  0.00% )
   150,000,039      instructions:u            #    0.37  insn per cycle           ( +-  0.00% )
    25,000,031      branches:u                #  273.455 M/sec                    ( +-  0.00% )

   0.092365514 seconds time elapsed                                          ( +-  0.52% )

In this case, I was testing mov rax, [rax], naturally-aligned, so cycles = 4*L1-dcache-loads. 4c latency. I didn't disable turbo or anything like that. Since nothing is going off the core, core clock cycles is the best way to measure.

  • [base + 0..2047]: 4c load-use latency, 11c cache-line split, 11c 4k-page split (even when inside the same hugepage). See Is there a penalty when base+offset is in a different page than the base? for more details: if base+disp turns out to be in a different page than base, the load uop has to be replayed.
  • any other addressing mode: 5c latency, 11c cache-line split, 12c 4k-split (even inside a hugepage). This includes [rax - 16]. It's not disp8 vs. disp32 that makes the difference.

So: hugepages don't help avoid page-split penalties (at least not when both pages are hot in the TLB). A cache-line split makes addressing mode irrelevant, but "fast" addressing modes have 1c lower latency for normal and page-split loads.

4k-split handling is fantastically better than before, see @harold's numbers where Haswell has ~32c latency for a 4k-split. (And older CPUs may be even worse than that. I thought pre-SKL it was supposed to be ~100 cycle penalty.)

Throughput (regardless of addressing mode), measured by using a destination other than rax so the loads are independent:

  • no split: 0.5c.
  • CL-split: 1c.
  • 4k-split: ~3.8 to 3.9c (much better than pre-Skylake CPUs)

Same throughput/latency for movzx/movsx (including WORD splits), as expected because they're handled in the load port (unlike some AMD CPUs, where there's also an ALU uop).

Cache-line split loads get replayed from the RS (Reservation Station). counters for uops_dispatched_port.port_2 + port_3 = 2x number of mov rdi, [rdi], in another test using basically the same loop. (This was a dependent-load case, not throughput limited.) You can't detect a split load until after AGU.

Presumably when a load uop finds out that it needs data from a 2nd line, it looks for a split register (the buffer that Intel CPUs use to handle split loads), and puts the needed part of the data from the first line into that split reg. And also signals back to the RS that it needs to be replayed. (This is guesswork.)

I think even if neither cache line is present on a split, the split-load replay should happen within a few cycles (perhaps as soon as the load port reports back to the RS that it was a split, i.e. after address-generation). So demand-load requests for both sides of the split can be in flight at once.


See also Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up? for more about uop replays. (But note that's for uops dependent on a load, not the load uop itself. In that Q&A, the dependent uops are also mostly loads.)

A cache-miss load doesn't itself need to be replayed to "accept" the incoming data when it's ready, only dependent uops. See chat discussion on Are load ops deallocated from the RS when they dispatch, complete or some other time?. This https://godbolt.org/z/HJF3BN NASM test case on i7-6700k shows the same number of load uops dispatched regardless of L1d hits or L3 hits. But the number of ALU uops dispatched (not counting loop overhead) goes from 1 per load to ~8.75 per load. The the scheduler aggressively schedules uops consuming the data to dispatch in the cycle when load data might arrive from L2 cache (and then very aggressively after that, it seems), instead of waiting one extra cycle to see if it did or not.

We haven't tested how aggressive replay is when there's other independent but younger work that could be done on the same port whose inputs are definitely ready.


SKL has two hardware page-walk units, which is probably related to the massive improvement in 4k-split performance. Even when there are no TLB misses, presumably older CPUs had to account for the fact that there might be.

It's interesting that the 4k-split throughput is non-integer. I think my measurements had enough precision and repeatability to say this. Remember this is with every load being a 4k-split, and no other work going on (except for being inside a small dec/jnz loop). If you ever have this in real code, you're doing something really wrong.

I don't have any solid guesses at why it might be non-integer, but clearly there's a lot that has to happen microarchitecturally for a 4k-split. It's still a cache-line split, and it has to check the TLB twice.

这篇关于如何在 x86_64 上准确地对未对齐的访问速度进行基准测试?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆