字符串拆分出内存 [英] String split out of memory

查看：171 发布时间：2020/4/29 3:26:42 java string split out-of-memory large-data

本文介绍了字符串拆分出内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我以DATE NAME MESSAGE的形式收集了大量制表符分隔的文本数据.总的来说，我的收藏是1.76GB，分为1075个实际文件.我必须从 all 文件中获取NAME数据.到现在为止我有这个:

I have a large collection of tab separated text data in the form of DATE NAME MESSAGE. By large I mean, a collection of 1.76GB divided into 1075 actual files. I have to get the NAME data from all the files. Till now I have this:

   File f = new File(directory);
        File files[] = f.listFiles();
        // HashSet<String> all = new HashSet<String>();
        ArrayList<String> userCount = new ArrayList<String>();
        for (File file : files) {
            if (file.getName().endsWith(".txt")) {
                System.out.println(file.getName());
                BufferedReader in;
                try {
                    in = new BufferedReader(new FileReader(file));
                    String str;
                    while ((str = in.readLine()) != null) {
                        // if (all.add(str)) {
                        userCount.add(str.split("\t")[1]);
                        // }

                        // if (all.size() > 500)
                        // all.clear();
                    }
                    in.close();
                } catch (IOException e) {
                    System.err.println("Something went wrong: "
                            + e.getMessage());
                }

            }
        }

即使使用-Xmx1700，我的程序也总是发出内存不足异常.我不能超越.无论如何，我是否可以优化代码，使其可以处理NAME s的ArrayList<String>?

My program is always giving out of memory exception even with -Xmx1700. I cannot go beyond that. Is there anyway I can optimize the code so that it can handle the ArrayList<String> of NAMEs?

推荐答案

由于您似乎允许使用Java以外的替代解决方案，因此这是一个应该处理的awk解决方案.

Since you seem to be allowing alternative solutions than Java, here's an awk one that should handle it.

cat *.txt | awk -F'\t' '{sum[$2] += 1} END {for (name in sum) print name "," sum[name]}'

说明:

-F'\t' - separate on tabs
sum[$2] += 1 - increment the value for the second element (name)

关联数组非常简洁.在我创建的测试文件上运行它，如下所示:

Associative arrays make this extremely succinct. Running it on a test file I created as follows:

import random

def main():
    names = ['Nick', 'Frances', 'Carl']
    for i in range(10000):
        date = '2012-03-24'
        name = random.choice(names)
        message = 'asdf'
        print '%s\t%s\t%s' %(date, name, message)

if __name__ == '__main__':
    main()

我得到结果:

Carl,3388
Frances,3277
Nick,3335

这篇关于字符串拆分出内存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

字符串拆分出内存 [英] String split out of memory

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

字符串拆分出内存 [英] String split out of memory

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭