读取HUGE csv文件时,内存问题,STORE为Person对象,写入多个清洁/较小的CSV文件 [英] Memory issue when Reading HUGE csv file, STORE as Person objects, Write into multiple cleaner/smaller CSV files

查看:141
本文介绍了读取HUGE csv文件时,内存问题,STORE为Person对象,写入多个清洁/较小的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文本文件用逗号分隔的值。一个是150MB,另一个是370MB,所以这些家伙有三百万行数据。



一个文档保存有关信息,比如软饮料的偏好,有信息,让我们说头发的颜色。



示例软饮料数据文件,虽然在真实文件中UniqueNames不是顺序,日期也不是:

 UniqueName,softDrinkBrand,year
001,diet pepsi,2004 $ b001,diet coke,2006
001,diet pepsi,2004
002,diet pepsi,2005
003,coca cola,2004

本质上,要使用excel,所以我想创建Person对象使用Person类来保存每个人的数据。



每个Person对象保存二十个数组列表,十年2004-2013,例如

  ... 
private ArrayList< String> sodas2013 = new ArrayList< String>();
private ArrayList< String> hairColors2013 = new ArrayList< String>();
private ArrayList< String> sodas2014 = new ArrayList< String>();
private ArrayList< String> hairColors2014 = new ArrayList< String>();
...

我写了一个程序来读取数据文件的行一次,使用BufferedReader。
对于每一行,我清理数据(在逗号上分割,删除引号...),然后,如果特定的uniqueID不在Hashtable中,我添加它,以及创建一个新的Person对象从我的Person类,然后我存储我想要的数据类型ArrayList如上所述。如果唯一的ID已经存在,我只是调用一个Person方法,看看苏打或头发的颜色,是否已经在该特定年份的数组列表(如写在csv文件中)。



目标是最终输出20个不同的csv文件,其中一个将人们每年喝到苏打,一个是年份的头发颜色。它们看起来像这样...



2004文件使用上面的示例输入文件:

 code> UID pepsi coca cola diet pepsi diet coke etc 
001 false false true false etc
002 false false false false
003 false true false false etc

现在,当我有只有100行的测试文件,这很漂亮。我保存所有的数据在我的Person对象,然后我使用方法匹配Hashtable uniqueNames到uniqueSoftDrinkNames年存储在Person对象写行的personID的文件,然后true / false任何uniqueID尝试的每个可能的苏打任何一年。数据看起来像上面的信息。



所以,我知道代码工作,并做我想要的。现在的问题是...

 线程main中的异常java.lang.OutOfMemoryError:Java堆空间$ b $ < init>(未知来源)
at java.lang.StringBuffer.toString(未知来源)
在java.lang.String的
$ b at java.util.regex.Matcher.appendReplacement(未知源)
at java.util.regex.Matcher.replaceAll(未知源)
at java.lang.String.replaceAll(Unknown Source)
at CleanDataFiles.main(CleanDataFiles.java:43)

第43行: / p>

  temp = temp.replaceAll(\,); 

p $ ...

...这只是一个简单的点,在用逗号分割一行之后,删除给定子字符串中的引号。

这个程序运行这个程序大约需要十分钟才能达到这个错误,并且我运行程序时,它给了我相同的错误和同一行。



我正在逐行读取CSV文档,所以我不会在巨大的字符串或任何东西中存储大量的数据,因为我读取的文件。我存储大量数据的唯一地方是在我的主类中的我的存储personIDs和personObjects的两个哈希表,我存储所有可能的头发颜色和所有可能的苏打水,在所有这些人对象,每个



我的假设是,内存问题是存储这些成千上万的唯一人物对象与所有的数据。也就是说,我在程序的一部分在同一个地方的错误,我只是读取csv文件和清理个别条目...



在任何情况下,MY QUESTION(您都在等待这个!)



有更好的方法吗?而不是成千上万或成百上千的Person对象持有所有这些数据...我应该创建成千上万的人文本文件,并打开和关闭它们每次我读取CSV文件的一行,并查询是否信息是重复的还是新的,如果是新的,将它添加到Person文件?然后当所有的内容都完成时,打开每个人文件以读取信息,解释,然后一次一行地将其写入我不断增长的输出文件,关闭该人文件,然后打开下一行,等等。



或者,HOPEFULLY,有一个sillier和更容易解决问题在其他地方在这整个混乱你想,为了不清除内存,同时清洁并组织我的数据文件进行进一步分析?



我感谢任何帮助或建议!谢谢。

解决方案

这里有几个想法。首先,它可能是你有足够的内存在您的机器上,但只是没有足够的分配给JVM。尝试这样:



java -Xms2048M -Xmx4096M YourProgram



当然,这些值取决于



此外,为什么要在每个Person对象中使用String的ArrayList?如果你可以确定可能的苏打水或任何提前的时间,那么你可以使用一个数组的int,应该保存一些内存。



另一个选择是分段,第一次做苏打水,当你做完电吹风,et cetra。


I have two text files with comma delimited values. One is 150MB and the other is 370MB, so these guys have three million+ rows of data.

One document holds information about, let's say soft drink preferences, and the next might have information about, let's say hair colors.

Example soft drinks data file, though in the real file the UniqueNames are NOT in order, nor are the dates:

"UniqueName","softDrinkBrand","year"
"001","diet pepsi","2004"
"001","diet coke","2006"
"001","diet pepsi","2004"
"002","diet pepsi","2005"
"003","coca cola","2004"

Essentially, there are too many lines of data to use excel, so I want to create Person objects using a Person class to hold the data about each person.

Each Person object holds twenty array lists, two for each of ten years 2004-2013, e.g.,

...
private ArrayList<String> sodas2013= new ArrayList<String>();
private ArrayList<String> hairColors2013= new ArrayList<String>();
private ArrayList<String> sodas2014= new ArrayList<String>();
private ArrayList<String> hairColors2014= new ArrayList<String>();
...

I wrote a program to read the rows of a data file, one at a time, using a BufferedReader. For each row, I clean up the data (split on the commas, delete quote marks...), and then, if that particular uniqueID isn't in a Hashtable yet, I add it, as well as create a new Person object from my Person class, and then I store the data I want into the Person class' ArrayList as above. If the unique ID is already present, I just call a Person method to see if the soda, or hair color, is already in the array list for that particular year (as written in the csv file).

The goal is to output twenty different csv files in the end, one tying people to sodas drunk in each year, one to hair colors for that year. They would look like this...

2004 file using above example input file:

UID    pepsi    coca cola    diet pepsi    diet coke    etc
001    false    false    true    false    etc
002    false    false    false    false    etc
003    false    true    false    false    etc

Now, when I have test files of only like 100 lines each, this works beautifully. I save all the data in my Person objects, and then I use methods to match Hashtable uniqueNames to uniqueSoftDrinkNames by year stored in the Person objects to write files with rows of personID, then true/false for every possible soda that any uniqueID had tried in any year. The data looks like the above info.

So, I know the code works and does what I want it to. The problem, now is...

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.StringBuffer.toString(Unknown Source)
at java.util.regex.Matcher.appendReplacement(Unknown Source)
at java.util.regex.Matcher.replaceAll(Unknown Source)
at java.lang.String.replaceAll(Unknown Source)
at CleanDataFiles.main(CleanDataFiles.java:43)

Where line 43 is:

temp = temp.replaceAll("\"", "");

...which is just a simple point of getting rid of quote marks in a given substring after having split a line by the commas.

It takes about ten minutes of the computer running this program to reach this error, and both times I ran the program, it gave me the same error and the same line.

I'm reading the CSV document line by line, so I'm not storing huge amounts of data in a giant string or anything as I read the file. The only place I'm storing tons of data is in my Hashtables in my main class where I store personIDs and personObjects, and two more hashtables where I store all possible hair colors and all possible sodas, and in all of those person objects, each with twenty arraylists of all the soda and hair color info by year.

My supposition is that the memory issue is in storing these tens of thousands of unique person objects with all the data associated with them. That said, I got the error in the same place in a part of my program where I'm merely reading the csv file and cleaning up individual entries...

In any case, MY QUESTION (you were all waiting for this!)

Are there better ways to do this? Instead of tens of thousands or low hundreds of thousands of Person objects holding all this data... should I be creating tens of thousands of Person text files and opening and closing them each time I read a new line of the CSV file and query whether the information is duplicate or new, and if new, add it to the Person file? And then when all is said and done, open each person file to read the information, interpret, and then write it into my growing output file one line at a time, closing that person file, then opening the next one for the next line, etc.?

Or, HOPEFULLY, is there a sillier and easier to solve issue elsewhere in this whole mess do you think, in order to not run out of memory while cleaning up and organizing my data files for further analysis?

I appreciate any help or suggestions! Thank you.

解决方案

Here are a couple of thoughts. First, it may be that you have plenty of memory free on your machine but are just not allocating enough for the JVM. Try something like this:

java -Xms2048M -Xmx4096M YourProgram

Of course, the values will depend on how much memory your machine has.

Also, why are you using an ArrayList of String's in each Person object? If you can determine the possible sodas or whatever ahead of time then you could use an array of int's, that should save some memory.

Another option would be to do it piecewise, first do sodas and when you are done do haircolors, et cetra.

这篇关于读取HUGE csv文件时,内存问题,STORE为Person对象,写入多个清洁/较小的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆