在java中加载大型文本文件的最佳方法 [英] best way of loading a large text file in java

查看：143 发布时间：2019/1/8 11:55:13 java string memory

本文介绍了在java中加载大型文本文件的最佳方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文本文件，每行有一个整数序列：

I have a text file, with a sequence of integer per line:

47202 1457 51821 59788 
49330 98706 36031 16399 1465
...

该文件有300万行这种格式。我必须将此文件加载到内存中并从中提取5-gram并对其进行一些统计。我确实有内存限制（8GB RAM）。我试图最小化我创建的对象的数量（只有1个类，包含6个浮点变量，以及一些方法）。并且该文件的每一行基本上生成该类的对象数（与#ofwords中的行的大小成比例）。当C ++出现时，我开始觉得Java不是做这些事情的好方法。

The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.

编辑：
假设每行产生（n-1））该类的对象。其中n是由空格分隔的该行中的令牌数（即1457）。因此，考虑到每行10个字的平均大小，每条线平均映射到9个对象。因此，将有9 * 3 * 10 ^ 6个对象。所以，所需的内存是：9 * 3 * 10 ^ 6 *（8字节obj标头+ 6 x 4字节浮点数）+（一个地图（字符串，对象）和另一个映射（Integer，ArrayList（Objects）））。我需要将所有内容保存在内存中，因为之后会发生一些数学优化。

Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

推荐答案

阅读/解析文件：

以任何语言处理大文件的最佳方法是尝试 NOT 将它们加载到内存中。

The best way to handle large files, in any language, is to try and NOT load them into memory.

在java中，看看 MappedByteBuffer 。它允许您将文件映射到进程内存并访问其内容，而无需将整个内容加载到堆中。

In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.

您也可以尝试逐行读取文件并在阅读后丢弃每一行 - 再次避免将整个文件一次保留在内存中。

You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.

处理生成的对象

对于处理解析时产生的对象，有以下几种选择：

For dealing with the objects you produce while parsing, there are several options:

与文件本身相同 - 如果您可以执行任何想要执行的操作而不将所有内容保留在内存中（同时流式传输文件） - 这是最佳解决方案。你没有描述你试图解决的问题，所以我不知道这是否可能。

Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.

某种压缩 - 从Wrapper对象（Float）切换到基元（使用 flyweight模式之类的东西将数据存储在巨型float []数组中只构造短期对象来访问它，在数据中找到一些模式，允许你更紧凑地存储它

Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly

缓存/卸载 - 如果你的数据仍然存在不适合内存页面输出到磁盘。这可以像将番石榴扩展到页面到磁盘或引入一样简单一个像 ehcache 等的图书馆。

Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.

关于java集合和地图的说明

对于小型对象，特别是java集合和地图导致大量内存惩罚（主要是由于所有被包装为对象和Map.Entry内部类实例的存在）。以一个稍微不那么优雅的API为代价，如果内存消耗，您应该查看 gnu trove 集合。是一个问题。

For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

这篇关于在java中加载大型文本文件的最佳方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在java中加载大型文本文件的最佳方法 [英] best way of loading a large text file in java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在java中加载大型文本文件的最佳方法 [英] best way of loading a large text file in java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭