Java - 毫无例外地读取大文件(几GB) [英] Java - Read a big file (few GB) without exception

查看:127
本文介绍了Java - 毫无例外地读取大文件(几GB)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题非常简短。
我有一个文件

This question is very short. I have a File

Datei.trec-3,99 GB ,我用以下代码阅读:

Datei.trec-3,99 GB and i read it with this code:

public class Main {
    public static void main(String[] args) {
        byte[] content = null;
        try {
            content = Files.readAllBytes(Paths.get("D:", "Videos","Captures","Datei.trec"));
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(content);
    }
}

这是输出:

Exception in thread "main" java.lang.OutOfMemoryError: Required array size too large
    at java.nio.file.Files.readAllBytes(Unknown Source)
    at Main.main(Main.java:13)

有吗一种无异常读取数组的方法(Streams等)?
该文件小于允许的HEAP,因此应该可以在程序中一次存储所有数据。

Is there an way to read the array without an exception (Streams ect.)? The file is smaller than the allowed HEAP so it should be possible to store all the data at once in the program.

推荐答案

问题是保存所有数据所需的数组大于 MAX_BUFFER_SIZE ,这是在 java.nio.Files中定义的 as Integer.MAX_VALUE - 8

The issue is that the array required to hold all that data is larger than MAX_BUFFER_SIZE, which is defined in java.nio.Files as Integer.MAX_VALUE - 8:

public static byte[] readAllBytes(Path path) throws IOException {
        try (SeekableByteChannel sbc = Files.newByteChannel(path);
             InputStream in = Channels.newInputStream(sbc)) {
            long size = sbc.size();
            if (size > (long)MAX_BUFFER_SIZE)
                throw new OutOfMemoryError("Required array size too large");

            return read(in, (int)size);
        }
    }

这是必要的,因为数组是用整数索引的 - 这个是你能得到的最大阵容。

This is necessary because arrays are indexed by integers - this is the biggest array you can get.

你有三个选择:

流经文件

也就是说,打开文件,读取一个块,处理它,读取另一个块,一次又一次地处理它,直到你经历了整个事情。

That is, open the file, read a chunk, process it, read another chunk, process it, again and again until you've gone through the whole thing.

Java提供了许多类来执行此操作: InputStream Reader 扫描程序等 - 在大多数介绍性Java课程和书籍中都会讨论它们。研究其中之一。

Java provides lots of classes to do this: InputStream, Reader, Scanner etc. -- they are discussed early in most introductory Java courses and books. Study one of these.

示例 https://stackoverflow.com/a / 21706141/7512

这有用了取决于你能够在文件的早期部分做一些有价值的事情,而不知道将要发生什么。很多时候情况就是这样。其他时候你必须通过该文件进行多次传递。

The usefulness of this depends on you being able to do something worthwhile with an early part of the file, without knowing what's coming. A lot of the time this is the case. Other times you have to make more than one pass through the file.

文件格式通常设计为可以一次完成处理 - 这是一个好主意设计你自己的文件格式时要考虑到这一点。

File formats are often designed so that processing can be done in a single pass -- it's a good idea to design your own file formats with this in mind.

我注意到你的文件是 .trec 文件,这是一个屏幕捕获的视频。视频和音频格式特别适合流媒体设计 - 这也是您可以在下载结束前观看YouTube视频开头的原因。

I note that your file is a .trec file, which is a screen-captured video. Video and audio formats are especially likely to be designed for streaming -- which is the reason you can watch the start of a YouTube video before the end has downloaded.

内存映射

如果你真的需要跳转文件的内容来处理它,你可以打开它作为内存映射文件。

If you really need to jump around the content of the file to process it, you can open it as a memory mapped file.

查看 RandomAccessFile 的文档 - 这会为您提供一个 seek的对象( )方法,这样你就可以读取文件数据中的任意点。

Look at the documentation for RandomAccessFile - this gives you an object with a seek() method so you can read arbitrary points in the file's data.

读取多个数组

我只是为了完整性而加入;将整个文件粘贴到堆内存中是很难看的。但是如果你真的想要,你可以将字节存储在许多数组中 - 可能是 List< byte []> 。 Java-ish伪代码:

I include this only for completeness; it's ugly to slurp the whole file into heap memory. But if you really wanted to, you could store the bytes in a number of arrays -- perhaps a List<byte[]>. Java-ish pseudocode:

  List<byte[]> filecontents = new ArrayList<byte[]>();
  InputStream is = new FileInputStream(...);
  byte[] buffer = new byte[MAX_BUFFER_SIZE];
  int bytesGot = readUpToMaxBufferSizeFrom(file);
  while(bytesGot != -1) {
       byte[] chunk = new byte[bytesGot];
       System.arrayCopy(buffer, 0, chunk, 0, bytesGot);
       filecontents.add(chunk);
  }

这允许你最多 MAX_BUFFER_SIZE * Integer.MAX_INTEGER 字节。访问内容比使用简单数组稍微繁琐 - 但是实现细节可以隐藏在类中。

This allows you up to MAX_BUFFER_SIZE * Integer.MAX_INTEGER bytes. Accessing the contents is slightly more fiddly than using a simple array - but that implementation detail can be hidden inside a class.

当然,您需要将Java配置为堆大小很大 - 请参阅如何设置最大内存使用量对于JVM?

You would, of course, need to configure Java to have a huge heap size - see How to set the maximum memory usage for JVM?

不要这样做。

这篇关于Java - 毫无例外地读取大文件(几GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆