确定适当的缓冲区大小 [英] Determining Appropriate Buffer Size

查看:201
本文介绍了确定适当的缓冲区大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用ByteBuffer.allocateDirect()来分配一些缓冲区内存以便将文件读入内存,然后最终散列文件字节并从中获取文件哈希(SHA)。输入文件的大小范围很广,从几KB到几GB不等。

I am using ByteBuffer.allocateDirect() to allocate some buffer memory for reading a file into memory and then eventually hashing that files bytes and getting a file hash (SHA) out of it. The input files range greatly in size, anywhere from a few KB's to several GB's.

我已经阅读了几个关于选择缓冲区大小的线程和页面(甚至是SO上的一些)。有人建议尝试选择本机FileSystem使用的一个,以尽量减少部分块的读操作机会等。例如4100字节的缓冲区和NTFS默认为4096,因此额外的4位需要单独的读操作,非常浪费。

I have read several threads and pages (even some on SO) regarding selecting a buffer size. Some advised trying to select one that the native FileSystem uses in an attempt to minimalize chances of a read operation for a partial block,etc. Such as buffer of 4100 bytes and NTFS defaults to 4096, so the extra 4 bits would require a separate read operation, being extremely wasteful.

所以坚持使用2,1024,2048,4096,8192等的功能。我看到一些推荐的缓冲区大小为32KB,其他建议制作缓冲区输入文件的大小(对于小文件可能很好,但大文件呢?)。

So sticking with the powers of 2, 1024, 2048, 4096, 8192, etc. I have seen some recommend buffers the size of 32KB's, and other recommend making the buffer the size of the input file (probably fine for small files, but what about large files?).

坚持使用本机块大小的缓冲区有多重要?从现代的角度来看(假设现代SATA驱动器或更好的驱动器缓存至少8Mb,以及其他现代操作系统神奇来优化I / O)缓冲区大小有多重要,我应该如何最好地确定我要设置的大小?我可以静态设置它,还是动态确定它?感谢您的任何见解。

How important is it to stick to native block sized buffers? Modernly speaking (assuming modern SATA drive or better with at least 8Mb of on drive cache, and other modern OS "magic" to optimize I/O) how critical is the buffer size and how should I best determine what size to set mine to? I could statically set it, or dynamically determine it? Thanks for any insight.

推荐答案

回答直接问题:(1)文件系统倾向于使用2的幂,所以你想要做同样的事情。 (2)工作缓冲区越大,误差小的影响就越小。

To answer your direct question: (1) filesystems tend to use powers of 2, so you want to do the same. (2) the larger your working buffer, the less effect any mis-sizing will have.

如你所说,如果你分配4100并且实际的块大小是4096,你需要两次读取来填充缓冲区。相反,如果你有一个1,000,000字节的缓冲区,那么一个块高或低并不重要(因为它需要245个4096字节的块来填充该缓冲区)。此外,较大的缓冲区意味着操作系统有更好的机会订购读数。

As you say, if you allocate 4100 and the actual block size is 4096, you'll need two reads to fill the buffer. If, instead, you have a 1,000,000 byte buffer, then being one block high or low doesn't matter (because it takes 245 4096-byte blocks to fill that buffer). Moreover, the larger buffer means that the OS has a better chance to order the reads.

也就是说,我不会使用NIO。相反,我会使用一个简单的 BufferedInputStream ,我的 read()可能有一个1k缓冲区。

That said, I wouldn't use NIO for this. Instead, I'd use a simple BufferedInputStream, with maybe a 1k buffer for my read()s.

NIO的主要好处是将数据保留在Java堆之外。如果您正在读取和写入文件,例如,使用 InputStream 表示操作系统将数据读入JVM管理的缓冲区,则JVM将其复制到-heap buffer,然后将其再次复制到堆外缓冲区,然后OS读取堆外缓冲区以写入实际的磁盘块(通常添加自己的缓冲区)。在这种情况下,NIO将消除该本机堆副本。

The main benefit of NIO is keeping data out of the Java heap. If you're reading and writing a file, for example, using an InputStream means that the OS reads the data into a JVM-managed buffer, the JVM copies that into an on-heap buffer, then copies it again to an off-heap buffer, then the OS reads that off-heap buffer to write the actual disk blocks (and typically adds its own buffers). In this case, NIO will eliminate that native-heap copies.

但是,要计算哈希值,您需要将数据放在Java堆中,并且 Mac SPI 将它移动到那里。因此,您无法获得NBI将数据保留在堆外的好处,而IMO旧IO更容易编写。

However, to compute a hash, you need to have the data in the Java heap, and the Mac SPI will move it there. So you don't get the benefit of NBI keeping the data off-heap, and IMO the "old IO" is easier to write.

请不要忘记 InputStream.read() not 保证读取您要求的所有字节。

Just don't forget that InputStream.read() is not guaranteed to read all the bytes you ask for.

这篇关于确定适当的缓冲区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆