DKPro Core管道的可重用版本 [英] Reusable version of DKPro Core pipeline

查看:73
本文介绍了DKPro Core管道的可重用版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经将DKPro Core设置为Web服务,以接受输入并提供标记化的输出.该服务本身被设置为Jersey资源:

I have set up DKPro Core as a web service to take an input and provide a tokenised output. The service itself is set up as a Jersey resource:

@Path("/")
public class MyResource
{

  public MyResource()
  {
    // Nothing here
  }

  @GET
  public String generate(@QueryParam("q") final String input)
  {
    try
    {
      final JCasIterable en = iteratePipeline(
        createReaderDescription(StringReader.class, StringReader.PARAM_DOCUMENT_TEXT, input, StringReader.PARAM_LANGUAGE, "en")
       ,createEngineDescription(StanfordSegmenter.class)
       ,createEngineDescription(StanfordPosTagger.class)
       ,createEngineDescription(StanfordParser.class)
       ,createEngineDescription(StanfordNamedEntityRecognizer.class)
      );

      final StringBuilder sb = new StringBuilder();
      for (final JCas jCas : en)
      {
        for (final Token token : select(jCas, Token.class))
        {
          sb.append('[');
          sb.append(token.getCoveredText());
          sb.append(' ');
          sb.append(token.getPos().getPosValue());
          sb.append(']');
        }
      }
      return sb.toString();
    }
    catch (final Exception e)
    {
      throw new RuntimeException("Problem", e);
    }
  }
}

一切正常,但速度非常慢,每次输入需要7到10秒.我认为这是因为正在为每个请求重新创建管道.

Everything works but it is very slow, taking 7-10 seconds for each input. I assume that this is because the pipeline is being recreated for each request.

如何重新编写此代码,以将管道创建移至构造函数并减少单个请求的负担?请注意,可能同时存在多个请求,因此所有非线程安全的内容都必须包含在请求中.

How can this code be reworked to move the pipeline creation to the constructor and reduce the load for individual requests? Note that there could be multiple simultaneous requests so anything that isn't thread-safe will need to be inside the request.

推荐答案

创建一个CAS:

JCas jcas = JCasFactory.createJCas();

填写CAS

jcas.setDocumentText("This is a test");
jcas.setDocumentLanguage("en");

使用以下方法一次创建管道(并保留引擎以供进一步的请求使用)

Create the pipeline once (and keep the engine around for further requests) using

AnalysisEngine engine = createEngine(
   createEngineDescription(...),
   createEngineDescription(...),
   ...);

如果您一直在隐式创建引擎,则必须一遍又一遍地加载模型等.

If you create the engine implicitly all the time, it has to load models etc over and over again.

将管道应用于CAS

SimplePipeline.runPipeline(jcas, engine);

如果您想进一步加快处理速度,请为自己创建一个CASes池并在多个请求中重复使用它们-从头开始创建CAS会花费一些时间.

If you want to further speed up processing, then create yourself a pool of CASes and re-use them across multiple requests - creating a CAS from scratch takes a moment.

某些组件可能是线程安全的,而其他组件则可能不是.这在很大程度上取决于底层第三方库的实现.但是,DKPro Core中的包装器也未明确构建为线程安全的.例如,在默认配置中,取决于文档语言来加载和使用模型.如果您从多个线程使用同一分析引擎实例,则可能会引起问题.

Some components may be thread-safe, others may not. This is largely up to the implementation of the underlying third-party library. But also the wrappers in DKPro Core are not explicitly built to be thread-safe. For example, in the default configuration, models are loaded and used depending on the document language. If you use the same instance of an analysis engine from multiple threads, this would cause problems.

同样,您应该考虑创建一个预先实例化的管道池.但是,您将需要大量内存,因为每个实例都将加载自己的模型.有一些实验性功能可以在同一组件的实例之间共享模型,但是并没有进行过多测试.请注意,第三方工具也可能以非线程安全的方式实现了其模型.有关在DKPro Core中共享模型的信息,请参见有关邮件列表的讨论.

Again, you should consider creating a pool of pre-instantiated pipelines. You would need quite a bit of memory though, because each instance will be loading their own models. There is some experimental functionality to share models between instances of the same component, but it is not tested too much. Mind that third-party tools may also have implemented their models in a non-thread-safe manner. For model sharing in DKPro Core, see this discussion on the mailing list.

披露:我是DKPro Core开发人员之一.

这篇关于DKPro Core管道的可重用版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆