在使用FileInputStream时,如何确定理想的缓冲区大小? [英] How do you determine the ideal buffer size when using FileInputStream?

查看:2013
本文介绍了在使用FileInputStream时,如何确定理想的缓冲区大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从文件创建MessageDigest(哈希)的方法,我需要对很多文件(> = 100,000)执行此操作。我应该用多大的缓冲区来从文件中读取数据,以最大限度地提高性能?

大多数人都熟悉基本代码(这里我将重复一遍以防万一):
$ b

  MessageDigest md = MessageDigest.getInstance(SHA); 
FileInputStream ios = new FileInputStream(myfile.bmp);
byte [] buffer = new byte [4 * 1024]; //这个值应该是多少?
int read = 0; ((read = ios.read(buffer))> 0)
md.update(buffer,0,read);
ios.close();
md.digest();

最大化吞吐量的理想缓冲区大小是多少?我知道这是依赖于系统,我非常确定它的操作系统,文件系统,硬盘和硬盘依赖,并可能有其他硬件/软件。



(我应该指出,我对Java有点新鲜,所以这可能只是一些我不知道的Java API调用。)



编辑:我不知道提前使用哪种系统,所以我不能假设很多。 (因为这个原因,我正在使用Java)。



编辑:上面的代码缺少像try..catch这样的帖子更小的缓冲区大小与许多事情有关:文件系统块大小,CPU缓存大小和缓存延迟。 / p>

大多数文件系统都被配置为使用块大小4096或8192.理论上,如果您配置缓冲区大小以便读取比磁盘块多几个字节,对文件系统的操作可能是非常低效的(即,如果将缓冲区配置为一次读取4100字节,则每次读取都需要文件系统读取2个块)。如果这些块已经在高速缓存中,那么你最后付出的代价是RAM - > L3 / L2缓存延迟。如果你运气不好,而且这些块还没有在缓存中,那么你还要支付磁盘 - > RAM的延迟时间。

这就是为什么你看到大多数缓冲区大小为2的幂,并且通常大于(或等于)磁盘块大小。这意味着您的一个流读取可能会导致多个磁盘块读取 - 但这些读取将始终使用一个完整的块 - 没有浪费的读取。



现在,这是偏移在一个典型的流媒体场景中有相当多的一点,因为从下面读取的数据块在下次读取时仍然会在内存中(毕竟,我们正在这里进行顺序读取),所以您最终支付了内存 - >下次读取时L3 / L2缓存延迟价格,但不是磁盘 - > RAM延迟。在数量级方面,磁盘 - >内存延迟非常缓慢,几乎可以弥补你可能遇到的任何其他延迟。

所以,我怀疑如果你用不同的缓存大小运行一个测试(我自己没有这样做),你可能会发现缓存大小直到文件系统块的大小。在此之上,我怀疑事情会很快发展。



这里有一些情况和例外 - 系统的复杂性实际上是相当惊人的(只是处理L3 - > L2高速缓存传输是令人难以置信的复杂,并随着每种CPU类型而改变)。

这导致了真实世界的答案:如果你的应用程序有99%的存在,请将缓存大小设置为8192,然后继续(甚至更好,选择封装性能并使用BufferedInputStream来隐藏细节)。如果您处于高度依赖于磁盘吞吐量的应用程序的1%,请制定您的实施方案,以便换出不同的磁盘交互策略,并提供旋钮和拨号以允许用户测试和优化(或提出一些自我优化系统)。

I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?

Most everyone is familiar with the basic code (which I'll repeat here just in case):

MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
    md.update( buffer, 0, read );
ios.close();
md.digest();

What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, and HDD dependent, and there maybe other hardware/software in the mix.

(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)

Edit: I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)

Edit: The code above is missing things like try..catch to make the post smaller

解决方案

Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.

Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.

This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.

Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.

So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.

There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).

This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).

这篇关于在使用FileInputStream时,如何确定理想的缓冲区大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆