句子解析运行非常缓慢 [英] sentence parsing is running extremely slowly

查看:178
本文介绍了句子解析运行非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个句子解析器,该句子解析器可以读取文档并预测正确的点以使句子分解,而不会在不重要的句点(例如"Dr.")上分解.或".NET",所以我一直在尝试使用CoreNLP

I'm attempting to create a Sentence Parser that can read in a document and predict the correct points to break up a sentence while not breaking on unimportant periods such as "Dr." or ".NET", so I've been attempting to use CoreNLP

意识到PCFG的运行速度太慢(实质上是整个工作的瓶颈)后,我试图切换到Shift-Reduce解析(根据coreNLP网站,它的运行速度更快).

Upon realizing that PCFG was running way too slowly (and essentially bottlenecking my entire job) I attempted to switch to Shift-Reduce parsing (which according to the coreNLP website is way faster).

但是,SRParser的运行速度非常慢,我也不知道为什么(由于PCFG每秒处理1000个句子,因此SRParser的速度为100).

However, the SRParser is running extremely slowly and I have no idea why (as PCFG is processing 1000 sentences per second, the SRParser is doing 100).

这是两者的代码.可能值得注意的一件事是,每个文档"都有大约10到20个句子,因此它们很小:

Here is the code for both. One thing that might be note-worthy is that each "document" has about 10-20 sentences, so they're very small:

PCFG解析器:

class StanfordPCFGParser {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
 val pipeline = new StanfordCoreNLP(props)
  var i = 0
  val time = java.lang.System.currentTimeMillis()

  def parseSentence(doc:String ):List[String] = {
    val tokens = new Annotation(doc)
    pipeline.annotate(tokens)
    val sentences = tokens.get(classOf[SentencesAnnotation]).toList
sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
sentences.map(_.toString)
  }
}

Shift-Reduce解析器:

Shift-Reduce Parser:

class StanfordShiftReduceParser {
  val p = new Properties()
  p.put("annotators", "tokenize ssplit pos parse lemma ")
  p.put("parse.model", "englishSR.ser.gz")
  val corenlp = new StanfordCoreNLP(p)
  var i = 0
  val time = java.lang.System.currentTimeMillis()

  def parseSentences(text:String) = {
    val annotation = new Annotation(text)
    corenlp.annotate(annotation)
    val sentences = annotation.get(classOf[SentencesAnnotation]).toList
    sentences.foreach(s =>{ if(i%1000==0) println("parsed " + i + "in " + (java.lang.System.currentTimeMillis() - time)/1000 + " seconds" ); i = i+ 1})
    sentences.map(_.toString)
  }
}

这是我用于计时的代码:

Here is the code I used for the timing:

val originalParser = new StanfordPCFGParser
println("starting PCFG")
var time = getTime
sentences.foreach(originalParser.parseSentence)
time = getTime - time
println("PCFG parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + originalParser.i + "sentences")
val srParser = new StanfordShiftReduceParser
println("starting SRParse")
time = getTime()
sentences.foreach(srParser.parseSentences)
time = getTime - time
println("SR parser took " + time.asInstanceOf[Double] / 1000 + "seconds for 1000 documents to " + srParser.i + "sentences")

哪个给了我以下输出(我已经解析出由于数据源有问题而发生的不可令牌化"警告)

Which gives me the following output (I've parsed out the "Untokenizable" warnings which happen because of questionable data sources)

Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... starting PCFG
done [0.6 sec].
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 1 seconds
parsed 2000in 2 seconds
parsed 3000in 3 seconds
parsed 4000in 5 seconds
parsed 5000in 5 seconds
parsed 6000in 6 seconds
parsed 7000in 7 seconds
parsed 8000in 8 seconds
parsed 9000in 9 seconds
PCFG parser took 10.158 seconds for 1000 documents to 9558 sentences
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator parse
Loading parser from serialized file englishSR.ser.gz ... done [8.3 sec].
starting SRParse
Adding annotator lemma
parsed 0in 0 seconds
parsed 1000in 17 seconds
parsed 2000in 30 seconds
parsed 3000in 43 seconds
parsed 4000in 56 seconds
parsed 5000in 66 seconds
parsed 6000in 77 seconds
parsed 7000in 90 seconds
parsed 8000in 101 seconds
parsed 9000in 113 seconds
SR parser took 120.506 seconds for 1000 documents to 9558 sentences

任何帮助将不胜感激!

推荐答案

如果您所需要做的只是将一块文本分割成多个句子,则只需使用tokenizessplit注释符.解析器是完全多余的.所以:

If all you need to do is split a block of text into sentences, you only need the tokenize and ssplit annotators. The parser is completely superfluous. So:

props.put("annotators", "tokenize, ssplit")

这篇关于句子解析运行非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆