Java OutOfMemoryError:处理大文本文件时超出了GC开销限制 - 无法弄清楚如何提高性能 [英] Java OutOfMemoryError: GC overhead limit exceeded when processing large text file - cant figure out how to improve performance

查看:1550
本文介绍了Java OutOfMemoryError:处理大文本文件时超出了GC开销限制 - 无法弄清楚如何提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:我浏览了这个问题的所有主题,并且我明白它通常是由JVM设置和高效的编码决定的,但我不知道如何进一步改进。



我正在处理CAIDA网络拓扑的大型文本文件(1GB),这基本上是整个Internet IPv4拓扑的转储。每行的格式为节点大陆地区国家城市纬度经度,我需要过滤所有重复的节点(例如每个节点具有相同的经度/纬度)。

我为具有相同地理位置的所有节点分配一个唯一的名称,并维护每个地理位置 - >已经遇到的唯一名称的散列图。我还维护每个旧名称 - >唯一名称的哈希映射,因为在下一步中,我必须处理另一个文件,这些旧名称必须映射到每个位置的新唯一名称。



我用Java写这篇文章是因为这是所有其他处理发生的地方,但我得到了超出GC开销限制的错误。下面是我正在执行的代码和错误日志:

  Scanner sc = new Scanner(new File(geo)); 
String line = null;

HashMap< String,String> nodeGeoMapper = new HashMap< String,String>(); //将每个坐标映射到唯一的节点名称
HashMap< String,String> nodeMapper = new HashMap< String,String>(); //将每个原始节点名称映射到过滤的节点名称(每个地理坐标1个名称)

PrintWriter输出= new PrintWriter(geoFiltered);
output.println(#node.geo Name \tcontintent\tCountry\tregion\tcity\tlatitude\tlongitude);
int frenchCounter = 0;

//声明循环中使用的所有变量以避免创建数千个小对象
String [] fields = null;
String name = null;
字符串continent = null;
String country = null;
String region = null;
String city = null;
double latitude = 0.0;
double longitude = 0.0;
String key = null;
布尔seenBefore = true;
String newname = null;
String nodename = null;

while(sc.hasNextLine()){
line = sc.nextLine();
if(line.startsWith(node.geo)){

//处理一行并检索字段
fields = line.split(\ t ); //使用空格作为分隔符分割所有字段
name = fields [0];
name = name.trim()。split()[1]; // nodes.geo''N ...
continent =; //是空的并且被跳过
country = fields [2];
region = fields [3];
city = fields [4];
latitude = Double.parseDouble(fields [5]);
longitude = Double.parseDouble(fields [6]);

//我们只需要每个坐标对的一个节点,因此我们映射到一个唯一的名称
key = makeGeoKey(latitude,longitude);

//检查我们是否在
之前看到了具有这些坐标的节点seenBefore = true;
if(!nodeGeoMapper.containsKey(key)){
newname =N+ nodeCounter;
nodeCounter ++;
nodeGeoMapper.put(key,newname);
seenBefore = false;
if(country.equals(FR))
frenchCounter ++;
}
nodename = nodeGeoMapper.get(key); //检索分配给这些地理坐标的唯一名称
nodeMapper.put(name,nodename); //保留旧名称的引用到新名称,以便稍后映射


if(!seenBefore){
// System.out.println(node.geo +节点名+ \t +大陆+ \t +国+ \t +区域+ \t +城市+ \t +纬度+ \t+经度);
output.println(node.geo+ nodename +\ t+ continent +\ t+ country +\t+ region +\t+ city +\t+ latitude + \t +经度);
}

}
}
sc.close();
output.close();
nodeGeoMapper = null;

错误:

 线程main中的异常java.lang.OutOfMemoryError:在java.util.regex.Matcher处超过
的GC开销限制。< init>(未知源)$ java.util中的
.regex.Matcher.toMatchResult(未知源)
在java.util.Scanner.match(未知源)
在java.util.Scanner.hasNextLine(未知源)
在DataProcessing.filterGeoNodes (DataProcessing.java:236)
at DataProcessing.main(DataProcessing.java:114)

在执行过程中,我的java进程一直运行在80%的CPU上,总共有1,000,000K(大致)内存(笔记本电脑总共有4GB)。输出文件有59987个唯一节点,所以这是GeoLocation-> Name散列表中键值的数量。我不知道oldName-> NewName散列表的大小,但这应该小于Integer.Max_value,因为我的文本文件中没有那么多行。



我的两个问题是:




  • 我该如何改善我的代码以使用更少的内存或避免拥有如此多的GC? (编辑:请保持它与Java 7兼容)

  • (已解决)我已阅读JVM设置上的线程,如-Xmx1024m,但我不知道Eclipse中的位置IDE我可以更改这些设置。有人可以告诉我需要设置这些设置的位置以及我可能想要尝试的设置吗?

  • 谢谢解决方案:对于有类似问题的人,问题是nodeMapper hashmap必须存储3400万个String对象,这导致需要超过4GB的内存。我能够通过首先禁用GC阈值-XX来运行我的程序:-UseGCOverheadLimit,然后使用-Xmx4gb为我的Java进程分配4GBRAM。它花了很长时间来处理它,但它确实工作,它只是很慢,因为一旦Java达到3-4GB RAM,它花费大量时间收集垃圾而不是处理文件。一个更强大的系统不会有任何问题。感谢您的帮助!

    解决方案

    对于Eclipse运行配置中的JVM参数



    您也可以尝试在运行时添加此选项:
    -XX:-UseGCOverheadLimit

    有趣的解释这个标志和你的错误信息在这里


    Note: I browsed all topics on this problem and I understand that it's often down to JVM settings and efficient coding but I dont know how to improve even more.

    I am processing a large text file (1GB) of CAIDA network topologies, this is basically a dump of the entire Internet IPv4 topology. Each line is of format "node continent region country city latitude longitude" and I need to filter all the duplicate nodes (e.g. each node with the same lat/longitude).

    I assign a unique name to all nodes with the same geo location and maintain a hashmap of each geo location->unique name already encountered. I also maintain a hashmap of each oldname->unique name because in a next step I must process another file where these old names have to be mapped to the new unique name per location.

    I wrote this in Java because this is where all my other processing happens but I'm getting the "GC overhead limit exceeded" error. Below is my code which is being executed and the error log:

            Scanner sc = new Scanner(new File(geo));
            String line = null;
    
            HashMap<String, String> nodeGeoMapper = new HashMap<String, String>(); // maps each coordinate to a unique node name
            HashMap<String, String> nodeMapper = new HashMap<String, String>(); // maps each original node name to a filtered node name (1 name per geo coordinate)
    
            PrintWriter output = new PrintWriter(geoFiltered);
            output.println("#node.geo Name\tcontintent\tCountry\tregion\tcity\tlatitude\tlongitude");
            int frenchCounter = 0;
    
            // declare all variables used in loop to avoid creating thousands of tiny objects
            String[] fields = null;
            String name = null;
            String continent = null;
            String country = null;
            String region = null;
            String city = null;
            double latitude = 0.0;
            double longitude = 0.0;
            String key = null;
            boolean seenBefore = true;
            String newname = null;
            String nodename = null;
    
            while (sc.hasNextLine()) {
                line = sc.nextLine();
                if (line.startsWith("node.geo")) {
    
                    // process a line and retrieve the fields
                    fields = line.split("\t"); // split all fields using the space as separator
                    name = fields[0];
                    name = name.trim().split(" ")[1]; // nodes.geo' 'N...
                    continent = ""; // is empty and gets skipped
                    country = fields[2];
                    region = fields[3];
                    city = fields[4];
                    latitude = Double.parseDouble(fields[5]);
                    longitude = Double.parseDouble(fields[6]);
    
                    // we only want one node for each coordinate pair so we map to a unique name
                    key = makeGeoKey(latitude, longitude);
    
                    // check if we have seen a node with these coordinates before
                    seenBefore = true;
                    if (!nodeGeoMapper.containsKey(key)) {
                        newname = "N"+nodeCounter;
                        nodeCounter++;
                        nodeGeoMapper.put(key, newname);
                        seenBefore = false;
                        if (country.equals("FR"))
                            frenchCounter++;
                    }
                    nodename = nodeGeoMapper.get(key); // retrieve the unique name assigned to these geo coordinates
                    nodeMapper.put(name, nodename); // keep a reference from old name to new name so we can map later
    
    
                    if (!seenBefore) {
                    //  System.out.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);
                        output.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);
                    }
    
                }
            }
            sc.close();
            output.close();
            nodeGeoMapper = null;
    

    Error:

    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.regex.Matcher.<init>(Unknown Source)
    at java.util.regex.Matcher.toMatchResult(Unknown Source)
    at java.util.Scanner.match(Unknown Source)
    at java.util.Scanner.hasNextLine(Unknown Source)
    at DataProcessing.filterGeoNodes(DataProcessing.java:236)
    at DataProcessing.main(DataProcessing.java:114)
    

    During execution my java process was constantly running on 80% CPU with a total of 1,000,000K (roughly) memory (laptop has 4GB total). The output file got to 59987 unique nodes so this is the amount of key values in the GeoLocation->Name hashmap. I dont know the size of the oldName->NewName hashmap but this should be less than Integer.Max_value because there are not that many lines in my textfile.

    My two questions are:

    • how can I improve my code to use less memory or avoid having so much GC? (Edit: please keep it Java 7 compatible)

    • (solved) I've read threads on JVM settings like -Xmx1024m but I dont know where in the Eclipse IDE I can change these settings. Can someone please show me where I need to set these settings and which settings I may want to try?

    Thank you

    SOLVED: for people with a similar problem, the issue was the nodeMapper hashmap which had to store 34 million String objects which resulted in over 4GB of memory required. I was able to run my program by first disabling the GC threshold -XX:-UseGCOverheadLimit and then allocating 4GBRAM to my Java process using -Xmx4gb. It took a long time to process it but it did work, it was just slow because once Java reaches 3-4GB RAM it spends a lot of time collecting garbage rather than processing the file. A stronger system would not have had any problems. Thanks for all the help!

    解决方案

    For the JVM arguments in Eclipse run configuration

    Also you can try adding this option when running: -XX:-UseGCOverheadLimit

    Interesting explanation of this flag and your error message here

    这篇关于Java OutOfMemoryError:处理大文本文件时超出了GC开销限制 - 无法弄清楚如何提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆