在火花环境中的Uima Ruta Out of Memory问题 [英] Uima Ruta Out of Memory issue in spark context

查看:120
本文介绍了在火花环境中的Uima Ruta Out of Memory问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在apache spark上运行 UIMA 应用程序。有数百万个页面分批进行处理,由 UIMA RUTA 进行计算。但有一段时间我面临内存异常。它会在成功处理 2000 页面时抛出异常,但有些时间会在 500 页面上失败。

I'm running an UIMA application on apache spark. There are million of pages coming into batches to be processed by UIMA RUTA for calculation. But some time i'm facing out of memory exception.It throws exception sometime as it successfully process 2000 pages but some time fail on 500 pages.

应用程序日志

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
        at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
        at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
        at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
        at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
        at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
        at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
        at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
        at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT

WORDLIST EnglishStopWordList = 'stopWords.txt';
WORDLIST FiltersList = 'AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};


推荐答案

通常,UIMA Ruta内存使用率高的原因可以可以在RutaBasic(许多注释,覆盖信息)或RuleMatch(低效规则,许多规则元素匹配)中找到。

Normally, the reasons for high memory usage in UIMA Ruta can be found in RutaBasic (many annotation, coverage information) or in RuleMatch (inefficient rules, many rule element matches).

这个例子,问题似乎来自其他地方.tracktrace表示内存被一些析取规则元素用尽,这需要为s创建新的注释。扭曲比赛信息。

This your example, the problem seems to origin somewhere else. The stacktrace indicates that the memory is used up by some disjunctive rule element, which requires to create new annotations for storing the match information.

似乎UIMA Ruta的版本相当老,因为线号与我正在查看的源根本不匹配。

It seems that the version of UIMA Ruta is rather old since line number do not match at all with the source I am looking at.

在堆栈跟踪中有七个(!!!)调用 continueOwnMatch 。我一直在寻找一个可能会导致这样的事情的规则但却没有找到。这可能是一个旧版本,已在新版本中修复,或者某些预处理添加了额外的CW / SW / CAP注释。

There are seven (!!!) calls of continueOwnMatch in the stacktrace. I was looking for a rule that could cause something like this but found none. This could be a old flaw which has been fixed in newer versions, or some preprocessing added additional CW/SW/CAP annotations.

作为第一个建议,我建议两个事情:

As a first advice, I would suggest two things:


  1. 更新到UIMA Ruta 2.6.0

  2. 摆脱所有析取规则元素

脚本中并不真正需要析取规则元素。一般来说,如果不是真的需要它们就不应该使用。我根本没有在生产规则中使用它们。

The disjunctive rule elements are not really needed in your script. In general, they should not used at all if not really required. I do not use them at all in productive rules.

而不是(SW | CW | CAP)你可以只需写 W

而不是(SPECIAL {REGEXP(['\ - =()\\ [\\]])} | PM)你可以写 ANY {OR(REGEXP(['\ - =()\\ [\\]]),IS(PM))}

使用 ANY 作为匹配条件会降低运行时性能。在这个例子中,两个规则而不是规则lement重写可能更好,例如,像

Using ANY as a matching condition can reduce the runtime performance. In this example, two rules instead of the rule lement rewrite might be better, e.g., something like

SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};
PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

(规则开头的可选规则元素,没有任何规则)规则中的锚点不是可选的)

(optional rule elements at the start of a rule without any anchors in the rule are not optional)

btw,您的规则中有很多优化空间。如果我不得不猜测,我会说你可以得到删除至少一半的规则和90%的所有创建注释,这也会大大减少内存使用量。

btw, there is a lot of room for optimization in your rules. If I had to guess, I'd say you can get rid at least of half the rules and 90% of all created annotations, which would also considerably reduce the memory usage.

免责声明:我是UIMA的开发人员Ruta

这篇关于在火花环境中的Uima Ruta Out of Memory问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆