在java中加载大型文本文件的最佳方法 [英] best way of loading a large text file in java

查看:143
本文介绍了在java中加载大型文本文件的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,每行有一个整数序列:

I have a text file, with a sequence of integer per line:

47202 1457 51821 59788 
49330 98706 36031 16399 1465
...

该文件有300万行这种格式。我必须将此文件加载到内存中并从中提取5-gram并对其进行一些统计。我确实有内存限制(8GB RAM)。我试图最小化我创建的对象的数量(只有1个类,包含6个浮点变量,以及一些方法)。并且该文件的每一行基本上生成该类的对象数(与#ofwords中的行的大小成比例)。当C ++出现时,我开始觉得Java不是做这些事情的好方法。

The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.

编辑:
假设每行产生(n-1) )该类的对象。其中n是由空格分隔的该行中的令牌数(即1457)。因此,考虑到每行10个字的平均大小,每条线平均映射到9个对象。因此,将有9 * 3 * 10 ^ 6个对象。所以,所需的内存是:9 * 3 * 10 ^ 6 *(8字节obj标头+ 6 x 4字节浮点数)+(一个地图(字符串,对象)和另一个映射(Integer,ArrayList(Objects)))。我需要将所有内容保存在内存中,因为之后会发生一些数学优化。

Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

推荐答案

阅读/解析文件

以任何语言处理大文件的最佳方法是尝试 NOT 将它们加载到内存中。

The best way to handle large files, in any language, is to try and NOT load them into memory.

在java中,看看 MappedByteBuffer 。它允许您将文件映射到进程内存并访问其内容,而无需将整个内容加载到堆中。

In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.

您也可以尝试逐行读取文件并在阅读后丢弃每一行 - 再次避免将整个文件一次保留在内存中。

You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.

处理生成的对象

对于处理解析时产生的对象,有以下几种选择:

For dealing with the objects you produce while parsing, there are several options:


  1. 与文件本身相同 - 如果您可以执行任何想要执行的操作而不将所有内容保留在内存中(同时流式传输文件) - 这是最佳解决方案。你没有描述你试图解决的问题,所以我不知道这是否可能。

  1. Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.

某种压缩 - 从Wrapper对象(Float)切换到基元(使用 flyweight模式之类的东西将数据存储在巨型float []数组中只构造短期对象来访问它,在数据中找到一些模式,允许你更紧凑地存储它

Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly

缓存/卸载 - 如果你的数据仍然存在不适合内存页面输出到磁盘。这可以像将番石榴扩展到页面到磁盘或引入一样简单一个像 ehcache 等的图书馆。

Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.

关于java集合和地图的说明

对于小型对象,特别是java集合和地图导致大量内存惩罚(主要是由于所有被包装为对象和Map.Entry内部类实例的存在)。以一个稍微不那么优雅的API为代价,如果内存消耗,您应该查看 gnu trove 集合。是一个问题。

For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

这篇关于在java中加载大型文本文件的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆