异常阅读非常大的文件> 300 MB [英] exception while Read very large file > 300 MB

查看:85
本文介绍了异常阅读非常大的文件> 300 MB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是在READ& WRITE模式下打开一个大文件,我需要通过搜索起点和终点来搜索该文件中的部分文本。
然后我需要将搜索到的文本区域写入新文件并从原始文件中删除该部分。

My task is to open a large file in READ&WRITE mode and i need to search some portion of text in that file by searching starting and end point. Then i need to write that searched area of text to a new file and delete that portion from the original file.

上述过程我会做更多次。
所以我认为,对于这些过程,通过 CharBuffer 将文件加载到内存中会很容易,并且可以通过 MATCHER 类轻松搜索。
但是我在阅读时会得到 HeapSpace异常,即使我通过执行以下
增加到900MB java -Xms128m -Xmx900m readLargeFile
我的代码是

The above process i will do more times. So I thought that for these process, it will be easy by loading the file into memory by CharBuffer and can search easily by MATCHER class. But im getting HeapSpace exception while reading, even though i increased to 900MB by executing like below java -Xms128m -Xmx900m readLargeFile My code is

FileChannel fc = new FileInputStream(fFile).getChannel();
CharBuffer chrBuff = Charset.forName("8859_1").newDecoder().decode(fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()));




对于上面的代码,每一个
建议我一个坏主意
将所有内容加载到内存中如果
文件大小为300 MB意味着,由于charSet,它将是
600MB。

For that above code every one suggested me that its a bad idea to load everything into memory and If file size is 300 MB means, it will be 600MB due to charSet.

所以上面是我的任务,然后现在建议我一些高效方式。
请注意,我的文件大小会更强并且只使用 JAVA 才能执行这些操作。

So above is my task, then now suggest me some efficient ways. Note that my file size will be more and using JAVA only i've to do these things.

提前感谢...

推荐答案

您绝对不希望将300MB文件加载到单个大缓冲区中Java的。对于大型文件而言,处理事务的方式应该比使用普通I / O更有效,但是对映射到内存的整个文件运行 Matcher 就像你一样,你可以很容易地耗尽内存。

You definitely do NOT want to load a 300MB file into a single large buffer with Java. The way you're doing things is supposed to be more efficient for large files than just using normal I/O, but when you run a Matcher against an entire file mapped into memory as you are, you can very easily exhaust memory.

首先,你的代码存储器将文件映射到内存......这将在你的虚拟地址中消耗300兆内存文件的空间是 mmap 加入它,尽管它在堆外。 (请注意,300 Meg的虚拟地址空间被绑定,直到 MappedByteBuffer 被垃圾收集。请参阅下面的讨论。 map 警告你。)接下来,你创建一个 ByteBuffer 支持通过此 mmap ed文件。这应该没问题,因为它只是 mmap ed文件的视图,因此应该占用最少的额外内存。它将是堆中的一个小对象,带有指向堆外部大对象的指针。接下来,您将此解码为 CharBuffer ,这意味着您制作了300 MB缓冲区的副本,但您制作了600 MB副本(打开)堆)因为 char 是2个字节。

First, your code memory maps the file into memory ... this will consume 300 Meg of memory in your virtual address space as the file is mmaped into it, although this is outside the heap. (Note that the 300 Meg of virtual address space is tied up until the MappedByteBuffer is garbage collected. See below for discussion. The JavaDoc for map warns you about this.) Next, you create a ByteBuffer backed by this mmaped file. This should be fine, as it's just a "view" of the mmaped file and should thus take minimal extra memory. It will be a small object in the heap with a "pointer" to a large object outside the heap. Next, you decode this into a CharBuffer, which means you make a copy of the 300 MB buffer, but you make a 600 MB copy (on the heap) because a char is 2 bytes.

回复评论,并查看JDK源代码是可以肯定的,当您调用 map()作为OP时,您确实将整个文件映射到内存中。查看openJDK 6 b14 Windows本机代码 sun.nio.ch.FileChannelImpl.c ,它首先调用 CreateFileMapping ,然后调用 MapViewOfFile 。看一下这个来源,如果你要求将整个文件映射到内存中,这个方法就会按照你的要求完成。引用MSDN:

To respond to a comment, and looking at the JDK Source code to be sure, when you call map() as the OP is, you do in fact map the entire file into memory. Looking at openJDK 6 b14 Windows native code sun.nio.ch.FileChannelImpl.c, it first calls CreateFileMapping, then calls MapViewOfFile. Looking at this source, if you ask to map the whole file into memory, this method will do exactly as you ask. To quote MSDN:


映射文件使得文件的指定部分在调用进程的
地址空间中可见。

Mapping a file makes the specified portion of a file visible in the address space of the calling process.

对于大于地址空间的文件,您一次只能映射一小部分
的文件数据。第一个视图完成后,您可以取消映射,
映射一个新视图。

For files that are larger than the address space, you can only map a small portion of the file data at one time. When the first view is complete, you can unmap it and map a new view.

OP调用的方式map,文件的指定部分是整个文件。这不会导致耗尽,但它可能导致虚拟地址空间耗尽,这仍然是OOM错误。这可能会像耗尽堆一样彻底地杀死你的应用程序。

The way the OP is calling map, the "specified portion" of the file is the entire file. This won't contribute to heap exhaustion, but it can contribute to virtual address space exhaustion, which is still an OOM error. This can kill your application just as thoroughly as running out of heap.

最后,当你创建一个匹配器时, 匹配器可能会生成600 MB CharBuffer 的更多副本,具体取决于您的使用方式。哎哟。这是少量对象使用的大量内存!如果匹配每次你调用 toMatchResult(),你就会做一个字符串 整个 CharBuffer 的副本。此外,每次您调用 replaceAll(),最多只能创建字符串整个 CharBuffer 的副本。在最坏的情况下,您将生成一个 StringBuffer ,它将慢慢扩展到 replaceAll 结果的完整大小(应用很多堆上的内存压力),然后从那里做一个 String

Finally, when you make a Matcher, the Matcher potentially makes more copies of this 600 MB CharBuffer, depending on how you use it. Ouch. That's a lot of memory used by a small number of objects! Given a Matcher, every time you call toMatchResult(), you'll make a String copy of the entire CharBuffer. Also, every time you call replaceAll(), at best you will make a String copy of the entire CharBuffer. At worst you will make a StringBuffer that will slowly be expanded to the full size of the replaceAll result (applying a lot of memory pressure on the heap), and then make a String from that.

因此,如果你打电话给 replaceAll Matcher 上对300 MB文件,找到你的匹配,然后你将首先制作一系列更大的 StringBuffer s,直到得到一个600 MB的。然后你将制作一个 String 这个 StringBuffer 的副本。这可以快速轻松地导致堆耗尽。

Thus, if you call replaceAll on a Matcher against a 300 MB file, and your match is found, then you'll first make a series of ever-larger StringBuffers until you get one that is 600 MB. Then you'll make a String copy of this StringBuffer. This can quickly and easily lead to heap exhaustion.

以下是底线:匹配器未针对工作进行优化在非常大的缓冲区上。您可以非常轻松地,而无需计划制作许多非常大的对象。我在做类似于你正在做的事情并遇到内存耗尽时发现了这个,然后查看 Matcher 的源代码。

Here's the bottom line: Matchers are not optimized for working on very large buffers. You can very easily, and without planning to, make a number of very large objects. I discovered this when doing something similar enough to what you're doing and encountering memory exhaustion, then looking at the source code for Matcher.

注意:没有 unmap 调用。一旦你打电话给 map ,堆外的虚拟地址空间被 MappedByteBuffer绑定一直停留在那里,直到 MappedByteBuffer 被垃圾收集。因此,您将无法对文件执行某些操作(删除,重命名,...),直到 MappedByteBuffer 被垃圾回收。如果在不同的文件上调用map足够的时间,但是在堆中没有足够的内存压力来强制进行垃圾回收,那么堆外的内存就会不足。有关讨论,请参阅错误4724038

NOTE: There is no unmap call. Once you call map, the virtual address space outside the heap tied up by the MappedByteBuffer is stuck there until the MappedByteBuffer is garbage collected. As a result, you will be unable to perform certain operations on the file (delete, rename, ...) until the MappedByteBuffer is garbage collected. If call map enough times on different files, but don't have sufficient memory pressure in the heap to force a garbage collection, you can out of memory outside the heap. For a discussion, see Bug 4724038.

作为上述所有讨论的结果,如果您将使用它在大型文件上制作 Matcher ,那么您将使用匹配上的 replaceAll ,那么内存映射的I / O可能不是可行的方法。它只会在堆上创建太多大对象,并在堆外部占用大量虚拟地址空间。在32位Windows下,您只有2GB(或者如果您更改了设置,3GB)的JVM虚拟地址空间,这将在堆内外施加巨大的内存压力。

As a result of all of the discussion above, if you will be using it to make a Matcher on large files, and you will be using replaceAll on the Matcher, then memory mapped I/O is probably not the way to go. It will simply create too many large objects on the heap as well as using up a lot of your virtual address space outside the heap. Under 32 bit Windows, you have only 2GB (or if you have changed settings, 3GB) of virtual address space for the JVM, and this will apply significant memory pressure both inside and outside the heap.

我为这个答案的长度道歉,但我想要彻底。如果您认为上述任何部分是错误的,请发表评论并说出来。我不会做报复性的downvotes。我非常肯定所有上述内容都是准确的,但如果出现问题,我想知道。

I apologize for the length of this answer, but I wanted to be thorough. If you think any part of the above is wrong, please comment and say so. I will not do retaliatory downvotes. I am very positive that all of the above is accurate, but if something is wrong, I want to know.

这篇关于异常阅读非常大的文件> 300 MB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆