在大型数据集上使用GATE时获取OOM [英] Getting OOM while using GATE on large data set

查看:124
本文介绍了在大型数据集上使用GATE时获取OOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对NLP很陌生,并且正在使用GATE.如果我为大型数据集(包含7K +记录)运行代码,则会收到OOM异常.下面是发生异常的代码.

I am quite new to NLP and am using GATE for it. I am getting OOM Exception if I run my code for large data set(containing 7K+ records). Below is the code where exception occurs.

    /**
 * Run ANNIE
 * 
 * @param controller
 * @throws GateException
 */
public void execute(SerialAnalyserController controller)
        throws GateException {
    TestLogger.info("Running ANNIE...");
    controller.execute();     /**** GateProcessor.java:217 ***/

    // controller.cleanup();
    TestLogger.info("...ANNIE complete");
}

这是日志:

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.addEntry(Unknown Source)
at java.util.HashMap.put(Unknown Source)
at java.util.HashMap.putAll(Unknown Source)
at gate.annotation.AnnotationSetImpl.<init>(AnnotationSetImpl.java:111)
at gate.jape.SinglePhaseTransducer.attemptAdvance(SinglePhaseTransducer.java:448)
at gate.jape.SinglePhaseTransducer.transduce(SinglePhaseTransducer.java:287)
at gate.jape.MultiPhaseTransducer.transduce(MultiPhaseTransducer.java:168)
at gate.jape.Batch.transduce(Batch.java:352)
at gate.creole.Transducer.execute(Transducer.java:116)
at gate.creole.SerialController.runComponent(SerialController.java:177)
at gate.creole.SerialController.executeImpl(SerialController.java:136)
at gate.creole.SerialAnalyserController.executeImpl(SerialAnalyserController.java:67)
at gate.creole.AbstractController.execute(AbstractController.java:42)
at in.co.test.GateProcessor.execute(GateProcessor.java:217)

我想知道execute函数到底发生了什么以及如何解决.谢谢.

I would like to know what exactly is happening with execute function and how it can be resolved. Thanks.

推荐答案

在GATE中处理大型(或多个)文档可能需要大量内存,GATE需要大量空间来存储批注.另一方面,各种处理资源也需要大量内存:地名词典,基于统计模型的标记器等.

Processing large (or many) documents in GATE can require lots of memory, GATE needs lots of space to store annotations. On the other hand various processing resources require lots of memory as well: gazetteers, statistical model-based taggers, etc.

Gate开发人员GUI中的一个技巧是将文档语料库存储在数据存储区中,然后仅加载语料库并运行管道. GATE足够聪明,可以一次加载一个文档,进行处理,然后保存并保存.在打开下一个之前将其关闭. (您可以先将一个空的语料库存储在数据存储中,然后从文件夹中填充"它,这将再次逐个加载文档而不会浪费内存.)

A trick in Gate developer GUI is to store the corpus of documents in a data store, then load only the corpus and run the pipeline. GATE is smart enough to load one document at a time, process it, then save & close it before opening the next one. (You can first store an empty corpus in a data store and then "populate" it from a folder, this will again load documents one by one without wasting memory.)

这正是您应该在代码中执行的操作,打开文档,进行处理,保存并关闭,然后再打开下一个.如果您只有一个大文档,则应将其拆分(以不影响注释性能的方式).

This is exactly what you should do in your code, open document, process, save and close before opening the next one. If you have a single large document you should split it (in a way that doesn't break your annotation performance).

这是嵌入式高级GATE"模块:

// for each piece of text:

Document doc = (Document)Factory.createResource("gate.corpora.DocumentImpl",
              Utils.featureMap("stringContent", text, "mimeType", mime));
Corpus corpus = Factory.newCorpus("webapp corpus");
try {
  corpus.add(doc);
  application.execute();
  ...
finally {
  corpus.clear();
  Factory.deleteResource(doc);
}

这篇关于在大型数据集上使用GATE时获取OOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆