小读(重叠、缓冲)优于大连续读的解释? [英] Explanation for tiny reads (overlapped, buffered) outperforming large contiguous reads?

查看:32
本文介绍了小读(重叠、缓冲)优于大连续读的解释?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(对于有点冗长的介绍表示歉意)

在开发将整个大文件 (>400MB) 预置入缓冲区缓存以加快实际运行速度的应用程序期间,我测试了一次读取 4MB 是否仍然比一次仅读取 1MB 块有任何明显的好处时间.令人惊讶的是,较小的请求实际上变得更快.这似乎违反直觉,所以我进行了更广泛的测试.

During development of an application which prefaults an entire large file (>400MB) into the buffer cache for speeding up the actual run later, I tested whether reading 4MB at a time still had any noticeable benefits over reading only 1MB chunks at a time. Surprisingly, the smaller requests actually turned out to be faster. This seemed counter-intuitive, so I ran a more extensive test.

缓冲区缓存在运行测试之前被清除(只是为了笑,我也用缓冲区中的文件运行了一次.无论请求大小如何,缓冲区缓存都提供了 2GB/s 以上的速度,尽管有一个令人惊讶的 +/- 30% 随机方差).
所有读取都使用具有相同目标缓冲区的重叠 ReadFile(句柄以 FILE_FLAG_OVERLAPPEDwithout FILE_FLAG_NO_BUFFERING 打开).使用的硬盘有些陈旧但功能齐全,NTFS 的簇大小为 8kB.初始运行后对磁盘进行碎片整理(6 个碎片与未碎片,零差异).为了更好的数字,我也使用了一个更大的文件,下面的数字是读取 1GB.

The buffer cache was purged before running the tests (just for laughs, I did one run with the file in the buffers, too. The buffer cache delivers upwards of 2GB/s regardless of request size, though with a surprising +/- 30% random variance).
All reads used overlapped ReadFile with the same target buffer (the handle was opened with FILE_FLAG_OVERLAPPED and without FILE_FLAG_NO_BUFFERING). The harddisk used is somewhat elderly but fully functional, NTFS has a cluster size of 8kB. The disk was defragmented after an initial run (6 fragments vs. unfragmented, zero difference). For better figures, I used a larger file too, below numbers are for reading 1GB.

结果确实令人惊讶:

4MB x 256    : 5ms per request,    completion 25.8s @ ~40 MB/s
1MB x 1024   : 11.7ms per request, completion 23.3s @ ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s @ ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s @ ~75 MB/s

因此,这表明提交一万个请求,两个集群的长度实际上比提交几百个大的连续读取要好.随着请求数量的增加,提交时间(ReadFile 返回之前的时间)确实会大幅增加,但异步完成时间几乎减半.
在异步读取完成时,内核 CPU 时间在每种情况下都约为 5-6%(在四核上,所以应该说 20-30%),这是一个令人惊讶的 CPU 数量——显然操作系统做了一些非-忙碌的等待也可以忽略不计.30% CPU 在 2.6 GHz 下持续 25 秒,这对于无所事事"来说是相当多的周期.

So, this suggests that submitting ten thousands of requests two clusters in length is actually better than submitting a few hundred large, contiguous reads. The submit time (time before ReadFile returns) does go up substantially as the number of requests goes up, but asynchronous completion time nearly halves.
Kernel CPU time is around 5-6% in every case (on a quadcore, so one should really say 20-30%) while the asynchronous reads are completing, which is a surprising amount of CPU -- apparently the OS does some non-neglegible amount of busy waiting, too. 30% CPU for 25 seconds at 2.6 GHz, that's quite a few cycles for doing "nothing".

知道如何解释吗?也许这里有人对 Windows 重叠 IO 的内部工作有更深入的了解?或者,您可以使用 ReadFile 读取 1 兆字节数据的想法是否存在重大问题?

Any idea how this can be explained? Maybe someone here has a deeper insight of the inner workings of Windows overlapped IO? Or, is there something substantially wrong with the idea that you can use ReadFile for reading a megabyte of data?

我可以看到 IO 调度程序如何通过最小化搜索来优化多个请求,尤其是当请求是随机访问时(它们不是!).我还可以看到硬盘如何根据 NCQ 中的一些请求执行类似的优化.
然而,我们谈论的是数量少得可笑的小请求——尽管如此,这些请求仍比看似合理的请求高出 2 倍.

I can see how an IO scheduler would be able to optimize multiple requests by minimizing seeks, especially when requests are random access (which they aren't!). I can also see how a harddisk would be able to perform a similar optimization given a few requests in the NCQ.
However, we're talking about ridiculous numbers of ridiculously small requests -- which nevertheless outperform what appears to be sensible by a factor of 2.

旁注:明显的赢家是内存映射.我几乎倾向于添加毫不奇怪",因为我是内存映射的忠实粉丝,但在这种情况下,它实际上确实让我感到惊讶,因为请求"甚至更小,操作系统应该甚至更不可能预测和安排 IO.我一开始没有测试内存映射,因为它甚至可以远程竞争似乎违反直觉.你的直觉就这么多,呵呵.

Sidenote: The clear winner is memory mapping. I'm almost inclined to add "unsurprisingly" because I am a big fan of memory mapping, but in this case, it actually does surprise me, as the "requests" are even smaller and the OS should be even less able to predict and schedule the IO. I didn't test memory mapping at first because it seemed counter-intuitive that it might be able to compete even remotely. So much for your intuition, heh.

以不同的偏移量重复映射/取消映射视图几乎需要零时间.使用 16MB 视图并使用简单的 for() 循环对每个页面进行错误读取,每页读取一个字节在 9.2 秒 @ ~111 MB/s 内完成.CPU 使用率始终低于 3%(一个核心).相同的计算机,相同的磁盘,相同的一切.

Mapping/unmapping a view repeatedly at different offsets takes practically zero time. Using a 16MB view and faulting every page with a simple for() loop reading a single byte per page completes in 9.2 secs @ ~111 MB/s. CPU usage is under 3% (one core) at all times. Same computer, same disk, same everything.

Windows 似乎一次将 8 个页面加载到缓冲区缓存中,尽管实际上只创建了一个页面.每 8 个页面出现故障以相同的速度运行并从磁盘加载相同数量的数据,但显示较低的物理内存"和系统缓存"指标,并且只有 1/8 的页面故障.随后的读取证明页面仍然在缓冲区缓存中(无延迟,无磁盘活动).

It also appears that Windows loads 8 pages into the buffer cache at a time, although only one page is actually created. Faulting every 8th page runs at the same speed and loads the same amount of data from disk, but shows lower "physical memory" and "system cache" metrics and only 1/8 of the page faults. Subsequent reads prove that the pages are nevertheless definitively in the buffer cache (no delay, no disk activity).

(可能与 内存非常、非常相关- 映射文件在大量顺序读取时速度更快?)

为了使它更具说明性:

To make it a bit more illustrative:

更新:

使用 FILE_FLAG_SEQUENTIAL_SCAN 似乎在某种程度上平衡"了 128k 的读取,将性能提高了 100%.另一方面,它严重影响 512k 和 256k 的读取(你一定想知道为什么?)并且对其他任何东西没有实际影响.较小块大小的 MB/s 图可以说看起来更均匀",但运行时没有区别.

Using FILE_FLAG_SEQUENTIAL_SCAN seems to somewhat "balance" reads of 128k, improving performance by 100%. On the other hand, it severely impacts reads of 512k and 256k (you have to wonder why?) and has no real effect on anything else. The MB/s graph of the smaller blocks sizes arguably seems a little more "even", but there is no difference in runtime.

我可能已经找到了一个解释,说明较小的块大小也表现得更好.如您所知,如果操作系统可以立即从缓冲区(以及针对各种版本特定的技术限制)处理请求,则异步请求可能会同步运行.

I may have found an explanation for smaller block sizes performing better, too. As you know, asynchronous requests may run synchronously if the OS can serve the request immediately, i.e. from the buffers (and for a variety of version-specific technical limitations).

在考虑实际异步与立即"异步读取时,您会注意到超过 256k,Windows 异步运行每个异步请求.块大小越小,立即"处理的请求就越多,即使它们不能立即可用(即 ReadFile 只是同步运行).我无法确定一个清晰的模式(例如前 100 个请求"或超过 1000 个请求"),但请求大小和同步性之间似乎存在负相关.块大小为 8k 时,每个异步请求都是同步提供的.
出于某种原因,缓冲同步传输的速度是异步传输的两倍(不知道为什么),因此请求大小越小,整体传输速度越快,因为同步完成的传输越多.

When accounting for actual asynchronous vs. "immediate" asyncronous reads, one notices that upwards of 256k, Windows runs every asynchronous request asynchronously. The smaller the blocksize, the more requests are being served "immediately", even when they are not available immediately (i.e. ReadFile simply runs synchronously). I cannot make out a clear pattern (such as "the first 100 requests" or "more than 1000 requests"), but there seems to be an inverse correlation between request size and synchronicity. At a blocksize of 8k, every asynchronous request is served synchronously.
Buffered synchronous transfers are, for some reason, twice as fast as asynchronous transfers (no idea why), hence the smaller the request sizes, the faster the overall transfer, because more transfers are done synchronously.

对于内存映射预故障,FILE_FLAG_SEQUENTIAL_SCAN 导致性能图的形状略有不同(有一个向后移动的缺口"),但所用的总时间完全相同(再次,这令人惊讶,但我忍不住).

For memory mapped prefaulting, FILE_FLAG_SEQUENTIAL_SCAN causes a slightly different shape of the performance graph (there is a "notch" which is moved a bit backwards), but the total time taken is exactly identical (again, this is surprising, but I can't help it).

更新 2:

无缓冲 IO 使 1M、4M 和 512k 请求测试用例的性能图略高且更加尖锐",最大值为 90 GB/s,但最小值也很苛刻,1GB 的整体运行时间在 +/- 缓冲运行的 0.5 秒(具有较小缓冲区大小的请求完成得明显更快,但这是因为超过 2558 个正在进行的请求会返回 ERROR_WORKING_SET_QUOTA).在所有无缓冲情况下测得的 CPU 使用率为零,这并不奇怪,因为发生的任何 IO 都通过 DMA 运行.

Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). Measured CPU usage is zero in all unbuffered cases, which is unsurprising, since any IO that happens runs via DMA.

FILE_FLAG_NO_BUFFERING 的另一个非常有趣的观察是它显着改变了 API 行为.CancelIO 不再起作用,至少在取消 IO 的意义上不是.对于未缓冲的正在进行的请求,CancelIO 将简单地阻塞,直到所有请求都完成.律师可能会争辩说,该功能不能因疏忽其职责而承担责任,因为当它返回时没有更多的飞行请求,所以在某种程度上它已经完成了要求 - 但我对取消"的理解有点不同.
使用缓冲,重叠IO,CancelIO将简单地切断绳索,所有正在进行的操作立即终止,正如人们所期望的那样.

Another very interesting observation with FILE_FLAG_NO_BUFFERING is that it significantly changes API behaviour. CancelIO does not work any more, at least not in a sense of cancelling IO. With unbuffered in-flight requests, CancelIO will simply block until all requests have finished. A lawyer would probably argue that the function cannot be held liable for neglecting its duty, because there are no more in-flight requests left when it returns, so in some way it has done what was asked -- but my understanding of "cancel" is somewhat different.
With buffered, overlapped IO, CancelIO will simply cut the rope, all in-flight operations terminate immediately, as one would expect.

另一个有趣的事情是,在所有请求完成或失败之前,该过程是不可终止的.如果操作系统在该地址空间中执行 DMA 操作,这种情况是有意义的,但它仍然是一个令人惊叹的功能".

Yet another funny thing is that the process is unkillable until all requests have finished or failed. This kind of makes sense if the OS is doing DMA into that address space, but it's a stunning "feature" nevertheless.

推荐答案

我不是文件系统专家,但我认为这里发生了一些事情.首先.w.r.t.您对内存映射的评论是赢家.这并不完全令人惊讶,因为 NT 缓存管理器基于内存映射 - 通过自己进行内存映射,您将复制缓存管理器的行为,而无需额外的内存副本.

I'm not a filesystem expert but I think there are a couple of things going on here. First off. w.r.t. your comment about memory mapping being the winner. This isn't totally surprising since the NT cache manager is based on memory mapping - by doing the memory mapping yourself, you're duplicating the cache manager's behavior without the additional memory copies.

当您从文件中顺序读取时,缓存管理器会尝试为您预取数据 - 因此您很可能会在缓存管理器中看到预读的效果.在某些时候缓存管理器停止预取读取(或者在某些时候预取数据不足以满足您的读取,因此缓存管理器必须停止).这可能是您看到的大型 I/O 速度放缓的原因.

When you read sequentially from the file, the cache manager attempts to pre-fetch the data for you - so it's likely that you are seeing the effect of readahead in the cache manager. At some point the cache manager stops prefetching reads (or rather at some point the prefetched data isn't sufficient to satisfy your reads and so the cache manager has to stall). That may account for the slowdown on larger I/Os that you're seeing.

您是否尝试将 FILE_FLAG_SEQUENTIAL_SCAN 添加到您的 CreateFile 标志中?这会指示预取器更加激进.

Have you tried adding FILE_FLAG_SEQUENTIAL_SCAN to your CreateFile flags? That instructs the prefetcher to be even more aggressive.

这可能违反直觉,但传统上从磁盘读取数据的最快方法是使用异步 I/O 和 FILE_FLAG_NO_BUFFERING.当你这样做时,I/O 直接从磁盘驱动程序进入你的 I/O 缓冲区,没有任何阻碍(假设文件的段是连续的 - 如果不是,文件系统将不得不发出多次磁盘读取以满足应用程序读取请求).当然这也意味着你失去了内置的预取逻辑,必须自己滚动.但是通过 FILE_FLAG_NO_BUFFERING,您可以完全控制您的 I/O 管道.

This may be counter-intuitive, but traditionally the fastest way to read data off the disk is to use asynchronous I/O and FILE_FLAG_NO_BUFFERING. When you do that, the I/O goes directly from the disk driver into your I/O buffers with nothing to get in the way (assuming that the segments of the file are contiguous - if they're not, the filesystem will have to issue several disk reads to satisfy the application read request). Of course it also means that you lose the built-in prefetch logic and have to roll your own. But with FILE_FLAG_NO_BUFFERING you have complete control of your I/O pipeline.

要记住的另一件事:当您进行异步 I/O 时,确保始终有一个 I/O 请求未完成非常重要 - 否则您会失去上一个 I/O 完成和下一个之间的潜在时间I/O 已启动.

One other thing to remember: When you're doing asynchronous I/O, it's important to ensure that you always have an I/O request oustanding - otherwise you lose potential time between when the last I/O completes and the next I/O is started.

这篇关于小读(重叠、缓冲)优于大连续读的解释?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆