Stanford-NLP:在Tomcat上使用解析器时,GC开销限制被取消 [英] Stanford-NLP : GC overhead limit excedded when using parser on Tomcat

查看:106
本文介绍了Stanford-NLP:在Tomcat上使用解析器时,GC开销限制被取消的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在将Stanford NLP集成到我们的系统中,并且工作正常,只是会导致gc overhead limit exceeded.我们有内存转储,并将对其进行分析,但是如果ányone对这个问题有所了解,请告诉我们.该服务器功能非常强大,SSD,32GB RAM,至强E5系列.

We are working on integrating Stanford NLP on our system and it is working fine, just that it causes gc overhead limit exceeded. WE have the memory dump and will analyze it, but if ányone has some idea about this issue, please let us know. The server is quite powerful, SSD, 32gb RAM, Xeon E5 series.

我们拥有的代码:

 String text = Jsoup.parse(groupNotes.getMnotetext()).text();
                String lang;
                try {
                    DetectorFactory.clear();
                    DetectorFactory.loadProfile("/home/deploy/profiles/");
                    Detector detector = DetectorFactory.create();
                    detector.append(text);
                     lang = detector.detect();
                }catch (Exception ignored){
                    lang = "de";
                }

                LexicalizedParser lp;
                if (lang.toLowerCase().equals("de")) {
                    lp = LexicalizedParser.loadModel(GERMAN_PCG_MODEL);
                } else {
                    lp = LexicalizedParser.loadModel(ENGLISH_PCG_MODEL);
                }
                Tree parse;
                parse = lp.parse(text);
                List<String> stringList = new ArrayList<>();
                List taggedWords = parse.taggedYield();
               // System.out.println(taggedWords);
                for (Object str : taggedWords) {
                    if (str.toString().contains("NN")) {
                        stringList.add(str.toString().replace("/NN", ""));
                    }
                    if (str.toString().contains("NNS")) {
                        stringList.add(str.toString().replace("/NNS", ""));
                    }
                    if (str.toString().contains("NNP")) {
                        stringList.add(str.toString().replace("/NNP", ""));
                    }
                    if (str.toString().contains("NNPS")) {
                        stringList.add(str.toString().replace("/NNPS", ""));
                    }
                    if (str.toString().contains("VB")) {
                        stringList.add(str.toString().replace("/VB", ""));
                    }
                    if (str.toString().contains("VBD")) {
                        stringList.add(str.toString().replace("/VBD", ""));
                    }
                    if (str.toString().contains("VBG")) {
                        stringList.add(str.toString().replace("/VBG", ""));
                    }
                    if (str.toString().contains("VBN")) {
                        stringList.add(str.toString().replace("/VBN", ""));
                    }
                    if (str.toString().contains("VBZ")) {
                        stringList.add(str.toString().replace("/VBZ", ""));
                    }
                    if (str.toString().contains("VBP")) {
                        stringList.add(str.toString().replace("/VBP", ""));
                    }
                    if (str.toString().contains("JJ")) {
                        stringList.add(str.toString().replace("/JJ", ""));
                    }
                    if (str.toString().contains("JJR")) {
                        stringList.add(str.toString().replace("/JJR", ""));
                    }
                    if (str.toString().contains("JJS")) {
                        stringList.add(str.toString().replace("/JJS", ""));
                    }
                    if (str.toString().contains("FW")) {
                        stringList.add(str.toString().replace("/FW", ""));
                    }
                }

Apache tomcat的JVM选项:

JVM options for Apache tomcat :

CATALINA_OPTS="$CATALINA_OPTS -server -Xms2048M -Xmx3048M -XX:OnOutOfMemoryError="/home/deploy/scripts/tomcatrestart.sh" -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemory -XX:HeapDumpPath=/path/to/date.hprof -XX:-UseGCOverheadLimit -Dspring.security.strategy=MODE_INHERITABLETHREADLOCAL"

有什么想法吗?

POM.xml:

  <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.7.0</version>
            <classifier>models</classifier>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.7.0</version>
            <classifier>models-german</classifier>
            <scope>provided</scope>
        </dependency>
 <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-parser</artifactId>
            <version>3.7.0</version>
         <scope>provided</scope>
        </dependency>


     <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.7.0</version>
         <scope>provided</scope>
        </dependency>

推荐答案

只是为了说明问题,lp.parse(...)应该在句子上运行,而不是段落或文档上. lp.parse(...)接受一个句子并返回该句子的分析树.如果在段落或文档长度的文本上运行它,肯定会遇到崩溃问题.如果您使用管道API,还可以设置将进行分析的最大长度句子,并且如果您提交的句子长于该句子,则只会得到平面解析.在现实世界的NLP中,通常情况下,您会遇到巨大的句子,这些句子只是事物的清单,跳过这些句子很有帮助,因为解析器会崩溃它们.正如我在评论中指出的那样,您可以在此处了解有关管道API的更多信息: http://stanfordnlp. github.io/CoreNLP/api.html

Just to clarify something, lp.parse(...) should be running on a sentence, not a paragraph or document. lp.parse(...) takes in a sentence and returns the parse tree for the sentence. You will definitely have crash issues if you run it on a paragraph or document length text. If you use the pipeline API you can also set the maximum length sentence the parsing will run on and if you hand in a longer sentence than that you will just get a flat parse. Often times in real-world NLP you will run into huge sentences which are just lists of things and it's helpful to just skip these since the parser will crash on them. As I noted in the comments you can learn more about the pipeline API here: http://stanfordnlp.github.io/CoreNLP/api.html

这篇关于Stanford-NLP:在Tomcat上使用解析器时,GC开销限制被取消的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆