为什么带有“直接"(O_DIRECT)标志的 dd 速度如此之快? [英] Why is dd with the 'direct' (O_DIRECT) flag so dramatically faster?

查看:59
本文介绍了为什么带有“直接"(O_DIRECT)标志的 dd 速度如此之快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 RAID50 配置的服务器,有 24 个驱动器(两组 12 个),如果我运行:

I have a server with a RAID50 configuration of 24 drives (two groups of 12), and if I run:

dd if=/dev/zero of=ddfile2 bs=1M count=1953 oflag=direct

我明白了:

2047868928 bytes (2.0 GB) copied, 0.805075 s, 2.5 GB/s

但是如果我跑:

dd if=/dev/zero of=ddfile2 bs=1M count=1953

我明白了:

2047868928 bytes (2.0 GB) copied, 2.53489 s, 808 MB/s

我知道 O_DIRECT 会导致页面缓存被绕过.但据我所知,绕过页面缓存基本上意味着避免使用 memcpy.使用带宽工具在我的桌面上进行测试,我的最坏情况下顺序内存写入带宽为 14GB/s,并且我想在更新的更昂贵的服务器上,带宽必须更好.那么为什么额外的 memcpy 会导致速度下降超过 2 倍呢?使用页面缓存时是否真的涉及更多?这是非典型的吗?

I understand that O_DIRECT causes the page cache to be bypassed. But as I understand it bypassing the page cache basically means avoiding a memcpy. Testing on my desktop with the bandwidth tool I have a worst case sequential memory write bandwidth of 14GB/s, and I imagine on the newer much more expensive server the bandwidth must be even better. So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache? Is this atypical?

推荐答案

oflag=direct 情况下:

  • 您让内核能够立即写出数据,而不是填充缓冲区并等待达到阈值/超时(这反过来意味着数据不太可能被不相关的同步数据).
  • 您正在保存内核工作(没有从用户空间到内核的额外副本,无需执行大多数缓冲区缓存管理操作).
  • 在某些情况下,脏缓冲区比刷新速度快会导致程序生成脏缓冲区,直到任意限制的压力得到缓解(请参阅 SUSE 的具有大 RAM 的 SLES 11/12 服务器上的低写入性能").

更一般地说,这个巨大的块大小(1 MByte)可能大于 RAID 的块大小,因此 I/O 将在内核中拆分,并且那些较小的块并行提交,因此大到足以使您从带有微小 I/O 的缓冲写回中获得的合并不会有太大价值(内核将开始拆分 I/O 取决于许多因素.此外,虽然 RAID 条带大小可能大于 1 MByte,但对于硬件 RAID,内核并不总是意识到这一点.在软件 RAID 的情况下,内核有时可以针对条带大小进行优化 - 例如,我所在的内核知道 md0 设备具有 4 MB 的条带大小,并通过 /表示它更喜欢该大小的 I/Osys/block/md0/queue/optimal_io_size).

More generally, that giant block size (1 MByte) is likely bigger than the RAID's block size so the I/O will be split up within the kernel and those smaller pieces submitted in parallel, thus big enough that the coalescing you get from buffered writeback with tiny I/Os won't be worth much (the exact point that the kernel will start splitting I/Os depends on a number of factors. Further, while RAID stripe sizes can be larger than 1 MByte, the kernel isn't always aware of this for hardware RAID. In the case of software RAID the kernel can sometimes optimize for stripe size - e.g. the kernel I'm on knows the md0 device has a 4 MByte stripe size and express a hint that it prefers I/O in that size via /sys/block/md0/queue/optimal_io_size).

鉴于上述所有情况,如果您在原始缓冲复制期间最大化了单个 CPU 并且您的工作负载不会从缓存/合并中受益太多,但磁盘可以处理更多吞吐量,则执行 O_DIRECT 由于内核开销减少,用户空间/服务磁盘 I/O 有更多 CPU 时间可用,因此复制速度应该会更快.

Given all the above, IF you were maxing out a single CPU during the original buffered copy AND your workload doesn't benefit much from caching/coalescing BUT the disk could handle more throughput THEN doing the O_DIRECT copy should go faster as there's more CPU time available for userspace/servicing disk I/Os due to the reduction in kernel overhead.

那么为什么额外的 memcpy 会导致 2 倍以上的减速?使用页面缓存的时候真的有很多吗?

So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache?

所涉及的不仅仅是额外的 memcpy 每个 I/O - 想想所有必须维护的额外缓存机制.有一个很好的解释如何将缓冲区复制到内核不是即时的以及页面压力如何减慢速度Linux async (io_submit) write v/s normal (buffered) write问题的答案.但是,除非您的程序能够足够快地生成数据并且 CPU 过载到无法足够快地为磁盘提供数据,否则它通常不会出现或无关紧要.

It's not just an extra memcpy per I/O that is involved - think about all the extra cache machinery that must be maintained. There is a nice explanation about how copying a buffer to the kernel isn't instantaneous and how page pressure can slow things down in an answer to the Linux async (io_submit) write v/s normal (buffered) write question. However, unless your program can generate data fast enough AND the CPU is so overloaded it can't feed the disk quickly enough then it usually doesn't show up or matter.

这是非典型的吗?

不,您的结果很典型对于您使用的那种工作负载.不过,如果块大小很小(例如 512 字节),我想这将是一个非常不同的结果.

No, your result is quite typical with the sort of workload you were using. I'd imagine it would be a very different outcome if the blocksize were tiny (e.g. 512 bytes) though.

让我们比较一下 fio 的一些输出,以帮助我们理解这一点:

Let's compare some of fio's output to help us understand this:

$ fio --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M_no_fsync
buffered_1M_no_fsync: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2511MiB/s][r=0,w=2510 IOPS][eta 00m:00s]
buffered_1M_no_fsync: (groupid=0, jobs=1): err= 0: pid=25408: Sun Aug 25 09:10:31 2019
  write: IOPS=2100, BW=2100MiB/s (2202MB/s)(20.0GiB/9752msec)
[...]
  cpu          : usr=2.08%, sys=97.72%, ctx=114, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%

所以使用缓冲,我们以大约 2.1 GBytes/s 的速度写入,但用完了整个 CPU.但是,块设备 (md0) 表示它几乎没有看到任何 I/O(ios=0/3 - 只有三个写入 I/O),这可能意味着大部分I/O 被缓存在 RAM 中!由于这台特定的机器可以轻松地在 RAM 中缓冲 20 GBytes,我们将使用 end_fsync=1 再次运行,以强制推送在运行结束时可能仅存在于内核 RAM 缓存中的任何数据到磁盘,从而确保我们记录所有数据实际到达非易失性存储所需的时间:

So using buffering we wrote at about 2.1 GBytes/s but used up a whole CPU to do so. However, the block device (md0) says it barely saw any I/O (ios=0/3 - only three write I/Os) which likely means most of the I/O was cached in RAM! As this particular machine could easily buffer 20 GBytes in RAM we shall do another run with end_fsync=1 to force any data that may only have been in the kernel's RAM cache at the end of the run to be pushed to disk thus ensuring we record the time it took for all the data to actually reach non-volatile storage:

$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M
buffered_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]      
buffered_1M: (groupid=0, jobs=1): err= 0: pid=41884: Sun Aug 25 09:13:01 2019
  write: IOPS=1928, BW=1929MiB/s (2023MB/s)(20.0GiB/10617msec)
[...]
  cpu          : usr=1.77%, sys=97.32%, ctx=132, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/40967, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2561, aggrmerge=0/2559, aggrticks=0/132223, aggrin_queue=127862, aggrutil=21.36%

好的,现在速度已经下降到大约 1.9 GBytes/s,我们仍然使用所有的 CPU,但是 RAID 设备中的磁盘声称它们有能力更快(aggrutil=21.36%).接下来直接 I/O:

OK now the speed has dropped to about 1.9 GBytes/s and we still use all a CPU but the disks in the RAID device claim they had capacity to go faster (aggrutil=21.36%). Next up direct I/O:

$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_1M 
direct_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=3242MiB/s][r=0,w=3242 IOPS][eta 00m:00s]
direct_1M: (groupid=0, jobs=1): err= 0: pid=75226: Sun Aug 25 09:16:40 2019
  write: IOPS=2252, BW=2252MiB/s (2361MB/s)(20.0GiB/9094msec)
[...]
  cpu          : usr=8.71%, sys=38.14%, ctx=20621, majf=0, minf=83
[...]
Disk stats (read/write):
    md0: ios=0/40966, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5120, aggrmerge=0/0, aggrticks=0/1283, aggrin_queue=1, aggrutil=0.09%

直接我们使用不到 50% 的 CPU 来实现 2.2 GBytes/s(但请注意 I/O 没有合并以及我们如何进行更多用户空间/内核上下文切换).如果我们要为每个系统调用推送更多 I/O,事情就会改变:

Going direct we use just under 50% of a CPU to do 2.2 GBytes/s (but notice how I/Os weren't merged and how we did far more userspace/kernel context switches). If we were to push more I/O per syscall things change:

$ fio --bs=4M --size=20G --rw=write --filename=zeroes --name=buffered_4M_no_fsync
buffered_4M_no_fsync: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2390MiB/s][r=0,w=597 IOPS][eta 00m:00s]
buffered_4M_no_fsync: (groupid=0, jobs=1): err= 0: pid=8029: Sun Aug 25 09:19:39 2019
  write: IOPS=592, BW=2370MiB/s (2485MB/s)(20.0GiB/8641msec)
[...]
  cpu          : usr=3.83%, sys=96.19%, ctx=12, majf=0, minf=1048
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%

$ fio --end_fsync=1 --bs=4M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_4M
direct_4M: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=5193MiB/s][r=0,w=1298 IOPS][eta 00m:00s]
direct_4M: (groupid=0, jobs=1): err= 0: pid=92097: Sun Aug 25 09:22:39 2019
  write: IOPS=866, BW=3466MiB/s (3635MB/s)(20.0GiB/5908msec)
[...]
  cpu          : usr=10.02%, sys=44.03%, ctx=5233, majf=0, minf=12
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%

由于没有 CPU 剩余,缓冲的 I/O 块大小为 4 MB,因此仅"达到 2.3 GBytes/s(即使我们没有强制刷新缓存)成为瓶颈.直接 I/O 使用了大约 55% 的 CPU,并设法达到了 3.5 GBytes/s,因此它比缓冲 I/O 快大约 50%.

With a massive block size of 4 MBytes buffered I/O became bottlenecked at "just" 2.3 GBytes/s (even when we didn't force the cache to be flushed) due to the fact that there's no CPU left. Direct I/O used around 55% of a CPU and managed to reach 3.5 GBytes/s so it was roughly 50% faster than buffered I/O.

总结:你的 I/O 模式并没有真正从缓冲中受益(I/O 很大,数据没有被重用,I/O 是顺序流)所以你处于 O_DIRECT 的最佳场景 速度更快.请参阅 Linux O_DIRECTO_DIRECT 的原作者的这些 幻灯片代码>(更长的PDF 文档,其中包含大多数幻灯片的嵌入式版本) 其背后的原始动机.

Summary: Your I/O pattern doesn't really benefit from buffering (I/Os are huge, data is not being reused, I/O is streaming sequential) so you're in an optimal scenario for O_DIRECT being faster. See these slides by the original author of Linux's O_DIRECT (longer PDF document that contains an embedded version of most of the slides) for the original motivation behind it.

这篇关于为什么带有“直接"(O_DIRECT)标志的 dd 速度如此之快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆