如何在Java中读取大文件(单个连续字符串)? [英] How to read large files (a single continuous string) in Java?

查看:400
本文介绍了如何在Java中读取大文件(单个连续字符串)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取一个很大的文件(〜2GB)。内容是一个带有句子的连续字符串(我想根据一个'来分割它们。)。无论我如何尝试,最终都会遇到内存不足错误。

I am trying to read a very large file (~2GB). Content is a continuous string with sentences (I would like to split them based on a '.'). No matter how I try, I end up with an Outofmemoryerror.

    BufferedReader in = new BufferedReader(new FileReader("a.txt"));
    String read = null;
    int i = 0;
    while((read = in.readLine())!=null) {
        String[] splitted = read.split("\\.");
        for (String part: splitted) {
            i+=1;
            users.add(new User(i,part));
            repository.saveAll(users);
        }
    }

也,

inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }

文件内容(由随机单词组成,带有完整的在10个字之后停止):

Content of the file (composed of random words with a full stop after 10 words):

fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc fmfbqi .xcdqnjqln kvjhw pexrbunnr cgvrqlr fpaczdegnb puqzjdbp gcfxne jawml aaiwwmo ugzoxn .opjc  (so on)

请帮助! p>

Please help!

推荐答案

首先,根据您对问题的评论,如Joachim Sauer所说:

So first and foremost, based on comments on your question, as Joachim Sauer stated:


如果没有换行符,则只有一行,因此只有一个行号。

If there are no newlines, then there is only a single line and thus only one line number.

因此,您的用例充其量是有缺陷的。

So your usecase is faulty, at best.

让我们越过这一点,并假设可能有换行符-或更好的是,假设您打算分割的字符是换行符替换。

Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.

扫描仪不是尽管还有其他方法,但这是一个不好的方法。由于您提供了 Scanner ,让我们继续进行,但是您要确保将其包装在 BufferedReader 。您显然没有很多内存,并且 BufferedReader 允许您读取文件的块,由 BufferedReader ,同时利用 Scanner 的功能使调用者完全看不到正在发生缓冲:

Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:

Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));

这实际上是在让 Scanner 功能如您所愿,但是允许您一次缓冲10MB,从而最大程度地减少了内存占用。现在,您只需继续调用

What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling

sc.useDelimiter("\\.");
for(int i = 0; sc.hasNext(); i++) {
    String psudeoLine = sc.next();
    //store line 'i' in your database for this psudeo-line
    //DO NOT store psudeoLine anywhere else - you don't have memory for it
}

由于您没有足够的内存,因此要进行迭代(并重复)的明确做法是不存储任何内容读取文件后,文件将位于JVM的堆空间中。阅读并根据需要使用它,并允许将其标记为JVM垃圾回收。对于您的情况,您提到要在数据库中存储伪线,因此要读取伪线,将其存储在数据库中,然后将其丢弃。

Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.

这里还有其他要指出的事情,例如配置JVM参数,但我什至不愿提及它,因为仅仅将JVM内存设置得太高也是一个坏主意-另一种蛮力方法。设置更高的JVM内存最大堆大小没有什么错,但是如果您仍在学习如何编写软件,则学习内存管理会更好。

There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.

此外,我提到了 Scanner BufferedReader ,因为您在问题中提到了这一点,但我认为请查看 java.nio.file.Path.lines()也是好主意。基本上,这与我明确列出的代码具有相同的作用,但需要注意的是,它一次只能执行1行,而无法更改您正在拆分的内容。因此,如果您的文本文件中只有1行,这仍然会给您带来麻烦,并且您仍然需要诸如扫描仪之类的工具才能将行分片。

Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.

这篇关于如何在Java中读取大文件(单个连续字符串)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆