Java-读取大文件(几GB) [英] Java - Reading a big file (few GB)

查看:146
本文介绍了Java-读取大文件(几GB)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

public class Main {
    public static void main(String[] args) {
        byte[] content = null;
        try {
            content = Files.readAllBytes(Paths.get("/path/to/file.ext"));
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(content);
    }
}

这是输出:

Exception in thread "main" java.lang.OutOfMemoryError: Required array size too large
    at java.nio.file.Files.readAllBytes(Unknown Source)
    at Main.main(Main.java:13)

有没有一种方法可以无例外地读取数组(Streams等)? 该文件小于允许的HEAP,因此应该可以一次将所有数据存储在程序中.

Is there an way to read the array without an exception (Streams ect.)? The file is smaller than the allowed HEAP so it should be possible to store all the data at once in the program.

推荐答案

问题是保存所有数据所需的数组大于MAX_BUFFER_SIZE,该数组在java.nio.Files中定义为Integer.MAX_VALUE - 8:

The issue is that the array required to hold all that data is larger than MAX_BUFFER_SIZE, which is defined in java.nio.Files as Integer.MAX_VALUE - 8:

public static byte[] readAllBytes(Path path) throws IOException {
        try (SeekableByteChannel sbc = Files.newByteChannel(path);
             InputStream in = Channels.newInputStream(sbc)) {
            long size = sbc.size();
            if (size > (long)MAX_BUFFER_SIZE)
                throw new OutOfMemoryError("Required array size too large");

            return read(in, (int)size);
        }
    }

这是必需的,因为数组是由整数索引的-这是您可以获得的最大数组.

This is necessary because arrays are indexed by integers - this is the biggest array you can get.

您有三个选择:

流式传输文件

也就是说,一次又一次地打开文件,读取一个块,对其进行处理,读取另一个块,对其进行处理,直到遍历整个过程为止.

That is, open the file, read a chunk, process it, read another chunk, process it, again and again until you've gone through the whole thing.

Java提供了许多类来实现此目的:InputStreamReaderScanner等.在大多数Java入门课程和书籍中都对它们进行了讨论.研究其中之一.

Java provides lots of classes to do this: InputStream, Reader, Scanner etc. -- they are discussed early in most introductory Java courses and books. Study one of these.

示例 https://stackoverflow.com/a/21706141/7512

此功能的有用性取决于您能够在文件的早期部分执行一些有价值的操作,而又不知道会发生什么.很多时候都是这种情况.其他时候,您必须多次通过文件.

The usefulness of this depends on you being able to do something worthwhile with an early part of the file, without knowing what's coming. A lot of the time this is the case. Other times you have to make more than one pass through the file.

通常设计文件格式,以便一次处理即可完成处理-考虑到这一点,设计自己的文件格式是个好主意.

File formats are often designed so that processing can be done in a single pass -- it's a good idea to design your own file formats with this in mind.

我注意到您的文件是.trec文件,它是屏幕捕获的视频.视频和音频格式特别适合流式传输-这就是您可以在下载完YouTube视频之前观看YouTube视频的原因.

I note that your file is a .trec file, which is a screen-captured video. Video and audio formats are especially likely to be designed for streaming -- which is the reason you can watch the start of a YouTube video before the end has downloaded.

内存映射

如果您确实需要跳转文件内容来进行处理,则可以将其作为映射的文件打开.

If you really need to jump around the content of the file to process it, you can open it as a memory mapped file.

请参阅RandomAccessFile的文档-这为您提供了一个具有seek()方法的对象,以便您可以读取文件数据中的任意点.

Look at the documentation for RandomAccessFile - this gives you an object with a seek() method so you can read arbitrary points in the file's data.

读取到多个阵列

我仅出于完整性考虑而包括在内;将整个文件保存到堆内存中是很丑陋的.但是,如果您确实愿意,可以将字节存储在多个数组中-也许是List<byte[]>. Java式的伪代码:

I include this only for completeness; it's ugly to slurp the whole file into heap memory. But if you really wanted to, you could store the bytes in a number of arrays -- perhaps a List<byte[]>. Java-ish pseudocode:

  List<byte[]> filecontents = new ArrayList<byte[]>();
  InputStream is = new FileInputStream(...);
  byte[] buffer = new byte[MAX_BUFFER_SIZE];
  int bytesGot = readUpToMaxBufferSizeFrom(file);
  while(bytesGot != -1) {
       byte[] chunk = new byte[bytesGot];
       System.arrayCopy(buffer, 0, chunk, 0, bytesGot);
       filecontents.add(chunk);
  }

这最多允许您MAX_BUFFER_SIZE * Integer.MAX_INTEGER个字节.与使用简单数组相比,访问内容要更轻松一些-但实现细节可以隐藏在类中.

This allows you up to MAX_BUFFER_SIZE * Integer.MAX_INTEGER bytes. Accessing the contents is slightly more fiddly than using a simple array - but that implementation detail can be hidden inside a class.

您当然需要将Java配置为具有巨大的堆大小-请参见

You would, of course, need to configure Java to have a huge heap size - see How to set the maximum memory usage for JVM?

不要这样做.

这篇关于Java-读取大文件(几GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆