字符串拆分出内存 [英] String split out of memory
问题描述
我以DATE NAME MESSAGE
的形式收集了大量制表符分隔的文本数据.总的来说,我的收藏是1.76GB,分为1075个实际文件.我必须从 all 文件中获取NAME
数据.到现在为止我有这个:
I have a large collection of tab separated text data in the form of DATE NAME MESSAGE
. By large I mean, a collection of 1.76GB divided into 1075 actual files. I have to get the NAME
data from all the files. Till now I have this:
File f = new File(directory);
File files[] = f.listFiles();
// HashSet<String> all = new HashSet<String>();
ArrayList<String> userCount = new ArrayList<String>();
for (File file : files) {
if (file.getName().endsWith(".txt")) {
System.out.println(file.getName());
BufferedReader in;
try {
in = new BufferedReader(new FileReader(file));
String str;
while ((str = in.readLine()) != null) {
// if (all.add(str)) {
userCount.add(str.split("\t")[1]);
// }
// if (all.size() > 500)
// all.clear();
}
in.close();
} catch (IOException e) {
System.err.println("Something went wrong: "
+ e.getMessage());
}
}
}
即使使用-Xmx1700,我的程序也总是发出内存不足异常.我不能超越.无论如何,我是否可以优化代码,使其可以处理NAME
s的ArrayList<String>
?
My program is always giving out of memory exception even with -Xmx1700. I cannot go beyond that. Is there anyway I can optimize the code so that it can handle the ArrayList<String>
of NAME
s?
推荐答案
由于您似乎允许使用Java以外的替代解决方案,因此这是一个应该处理的awk解决方案.
Since you seem to be allowing alternative solutions than Java, here's an awk one that should handle it.
cat *.txt | awk -F'\t' '{sum[$2] += 1} END {for (name in sum) print name "," sum[name]}'
说明:
-F'\t' - separate on tabs
sum[$2] += 1 - increment the value for the second element (name)
关联数组非常简洁.在我创建的测试文件上运行它,如下所示:
Associative arrays make this extremely succinct. Running it on a test file I created as follows:
import random
def main():
names = ['Nick', 'Frances', 'Carl']
for i in range(10000):
date = '2012-03-24'
name = random.choice(names)
message = 'asdf'
print '%s\t%s\t%s' %(date, name, message)
if __name__ == '__main__':
main()
我得到结果:
Carl,3388
Frances,3277
Nick,3335
这篇关于字符串拆分出内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!