Stanford CoreNLP在其他语言上的执行时间 [英] Execution time of Stanford CoreNLP on other languages

查看:158
本文介绍了Stanford CoreNLP在其他语言上的执行时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从大型语料库的英语和德语文本中提取句子,标记,pos标签和引理.因此,我使用了Stanford CoreNLP工具.它的输出是完美的.但是,问题是时间复杂度.英文的执行速度很快,但德国的模型需要较长时间来注释文本.我使用以下代码初始化模型:

I need to extract sentences, tokens, pos tags and lemma from English and German text of a big corpora. So, I used the Stanford CoreNLP tool. Its output is perfect. However, the problem is the time complexity. The English one executes quickly but the German model takes a long time to annotate the text. I initialize the models with these codes:

// To initialize English model
propsEN = new Properties();
propsEN.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsEN.setProperty("tokenize.language", "en");
corenlpEN = new StanfordCoreNLP(propsEN);


// To initialize German model
propsDE = new Properties();
propsDE.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsDE.setProperty("tokenize.language", "de");
corenlpDE = new StanfordCoreNLP(propsDE);

为了表示执行时间的差异,我计算了每个文本的长度以及每个模型在文本上运行所花费的时间.为了计算执行时间,我使用了System.currentTimeMillis()指令:

To represent the difference in execution times, I computed the length of each text and the time each model takes to run on the text. In order to calculate the execution time, I used System.currentTimeMillis() instruction:

英文文本长度= 1587 ---经过时间= 57

English text length=1587 --- Elapse time=57

英文文本长度= 15906 ---经过时间= 160

English text length=15906 --- Elapse time=160

英文文本长度= 44286 ---经过时间= 3287

English text length=44286 --- Elapse time=3287

英文文本长度= 19814 ---经过时间= 1809

English text length=19814 --- Elapse time=1809

英文文本长度= 1427 ---经过时间= 166

English text length=1427 --- Elapse time=166

英文文本长度= 56787 ---经过时间= 2374

English text length=56787 --- Elapse time=2374

在德语文本上执行Stanford CoreNLP模型:

德语文本长度= 979 ---经过时间= 401

German text length=979 --- Elapse time=401

德语文本长度= 22039 ---经过时间= 15285

German text length=22039 --- Elapse time=15285

德语文本长度= 30632 ---经过时间= 21659

German text length=30632 --- Elapse time=21659

德语文本长度= 42019 ---经过时间= 21767

German text length=42019 --- Elapse time=21767

德语文本长度= 2944 ---经过时间= 2005

German text length=2944 --- Elapse time=2005

德语文本长度= 76248 ---经过时间= 48857

German text length=76248 --- Elapse time=48857

为什么德国模特需要花费几次?我有做错什么吗?有解决问题的办法吗?

Why does German model take several times? Have I made any mistake? Is there any solution to solve the problem?

有关此主题的任何信息都将受到赞赏.

Any information about this topic is appreciated.

推荐答案

我不知道这是否会有所帮助,但是您没有使用语音标记的德语部分.您可以使用pos.model属性进行设置.

I don't know if this will help, but you're not using the German part of speech tagger. You can set that with the pos.model property.

以下是选项列表(请确保您拥有德国型号的罐子):

Here is a list of options (make sure you have the German models jar):

edu/stanford/nlp/models/pos-tagger/german/german-fast.tagger
edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger
edu/stanford/nlp/models/pos-tagger/german/german-fast-caseless.tagger
edu/stanford/nlp/models/pos-tagger/german/german-ud.tagger

德语也没有lemma.

这篇关于Stanford CoreNLP在其他语言上的执行时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆