使用 FileInputStream 时如何确定理想的缓冲区大小? [英] How do you determine the ideal buffer size when using FileInputStream?

查看:34
本文介绍了使用 FileInputStream 时如何确定理想的缓冲区大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个方法可以从一个文件中创建一个 MessageDigest(一个散列),我需要对很多文件(>= 100,000)这样做.我应该使用多大的缓冲区来读取文件以最大限度地提高性能?

I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?

大多数人都熟悉基本代码(我会在这里重复以防万一):

Most everyone is familiar with the basic code (which I'll repeat here just in case):

MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
    md.update( buffer, 0, read );
ios.close();
md.digest();

为了最大化吞吐量,理想的缓冲区大小是多少?我知道这取决于系统,而且我很确定它的操作系统、文件系统、 HDD 相关,并且可能还有其他硬件/软件.

What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, and HDD dependent, and there maybe other hardware/software in the mix.

(我应该指出我对 Java 有点陌生,所以这可能只是一些我不知道的 Java API 调用.)

(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)

我事先不知道这将用于哪些系统,所以我不能假设很多.(出于这个原因,我正在使用 Java.)

I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)

上面的代码缺少像 try..catch 这样的东西来缩小帖子

The code above is missing things like try..catch to make the post smaller

推荐答案

最佳缓冲区大小与许多因素有关:文件系统块大小、CPU 缓存大小和缓存延迟.

Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.

大多数文件系统都配置为使用 4096 或 8192 的块大小.理论上,如果您配置缓冲区大小以便读取的字节数比磁盘块多几个字节,则对文件系统的操作可能会非常低效(即,如果您将缓冲区配置为一次读取 4100 个字节,则每次读取将需要文件系统读取 2 个块).如果块已经在缓存中,那么您最终要付出 RAM -> L3/L2 缓存延迟的代价.如果你不走运并且块还没有在缓存中,你就要付出磁盘->内存延迟的代价.

Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.

这就是为什么您看到大多数缓冲区大小为 2 的幂,并且通常大于(或等于)磁盘块大小的原因.这意味着您的一次流读取可能会导致多个磁盘块读取 - 但这些读取将始终使用一个完整的块 - 不会浪费读取.

This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.

现在,这在典型的流媒体场景中被抵消了很多,因为当您点击下一次读取时,从磁盘读取的块仍将在内存中(毕竟我们在这里进行顺序读取) - 所以您最终会在下次读取时支付 RAM -> L3/L2 缓存延迟价格,而不是磁盘 -> RAM 延迟.就数量级而言,磁盘->RAM 延迟非常缓慢,几乎淹没了您可能正在处理的任何其他延迟.

Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.

因此,我怀疑如果您使用不同的缓存大小运行测试(我自己没有这样做过),您可能会发现缓存大小对文件系统块大小的影响很大.除此之外,我怀疑事情会很快趋于平稳.

So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.

这里有的条件和异常 - 系统的复杂性实际上非常惊人(只是处理 L3 -> L2 缓存传输非常复杂,它会随着每种 CPU 类型).

There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).

这导致了真实世界"的答案:如果您的应用程序 99% 可用,请将缓存大小设置为 8192 并继续(更好的是,选择封装而不是性能并使用 BufferedInputStream 隐藏详细信息).如果您属于高度依赖磁盘吞吐量的 1% 的应用程序,请精心设计您的实现,以便您可以交换不同的磁盘交互策略,并提供旋钮和刻度盘以允许您的用户进行测试和优化(或提出一些自优化系统).

This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).

这篇关于使用 FileInputStream 时如何确定理想的缓冲区大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆