Stanford CoreNLP在其他语言上的执行时间 [英] Execution time of Stanford CoreNLP on other languages
问题描述
我需要从大型语料库的英语和德语文本中提取句子,标记,pos标签和引理.因此,我使用了Stanford CoreNLP工具.它的输出是完美的.但是,问题是时间复杂度.英文的执行速度很快,但德国的模型需要较长时间来注释文本.我使用以下代码初始化模型:
I need to extract sentences, tokens, pos tags and lemma from English and German text of a big corpora. So, I used the Stanford CoreNLP tool. Its output is perfect. However, the problem is the time complexity. The English one executes quickly but the German model takes a long time to annotate the text. I initialize the models with these codes:
// To initialize English model
propsEN = new Properties();
propsEN.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsEN.setProperty("tokenize.language", "en");
corenlpEN = new StanfordCoreNLP(propsEN);
// To initialize German model
propsDE = new Properties();
propsDE.setProperty("annotators", "tokenize, ssplit, pos, lemma");
propsDE.setProperty("tokenize.language", "de");
corenlpDE = new StanfordCoreNLP(propsDE);
为了表示执行时间的差异,我计算了每个文本的长度以及每个模型在文本上运行所花费的时间.为了计算执行时间,我使用了System.currentTimeMillis()指令:
To represent the difference in execution times, I computed the length of each text and the time each model takes to run on the text. In order to calculate the execution time, I used System.currentTimeMillis() instruction:
英文文本长度= 1587 ---经过时间= 57
English text length=1587 --- Elapse time=57
英文文本长度= 15906 ---经过时间= 160
English text length=15906 --- Elapse time=160
英文文本长度= 44286 ---经过时间= 3287
English text length=44286 --- Elapse time=3287
英文文本长度= 19814 ---经过时间= 1809
English text length=19814 --- Elapse time=1809
英文文本长度= 1427 ---经过时间= 166
English text length=1427 --- Elapse time=166
英文文本长度= 56787 ---经过时间= 2374
English text length=56787 --- Elapse time=2374
在德语文本上执行Stanford CoreNLP模型:
德语文本长度= 979 ---经过时间= 401
German text length=979 --- Elapse time=401
德语文本长度= 22039 ---经过时间= 15285
German text length=22039 --- Elapse time=15285
德语文本长度= 30632 ---经过时间= 21659
German text length=30632 --- Elapse time=21659
德语文本长度= 42019 ---经过时间= 21767
German text length=42019 --- Elapse time=21767
德语文本长度= 2944 ---经过时间= 2005
German text length=2944 --- Elapse time=2005
德语文本长度= 76248 ---经过时间= 48857
German text length=76248 --- Elapse time=48857
为什么德国模特需要花费几次?我有做错什么吗?有解决问题的办法吗?
Why does German model take several times? Have I made any mistake? Is there any solution to solve the problem?
有关此主题的任何信息都将受到赞赏.
Any information about this topic is appreciated.
推荐答案
我不知道这是否会有所帮助,但是您没有使用语音标记的德语部分.您可以使用pos.model
属性进行设置.
I don't know if this will help, but you're not using the German part of speech tagger. You can set that with the pos.model
property.
以下是选项列表(请确保您拥有德国型号的罐子):
Here is a list of options (make sure you have the German models jar):
edu/stanford/nlp/models/pos-tagger/german/german-fast.tagger
edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger
edu/stanford/nlp/models/pos-tagger/german/german-fast-caseless.tagger
edu/stanford/nlp/models/pos-tagger/german/german-ud.tagger
德语也没有lemma
.
这篇关于Stanford CoreNLP在其他语言上的执行时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!