使用Stanford CoreNLP进行懒惰解析,以获得特定句子的情感 [英] Lazy parsing with Stanford CoreNLP to get sentiment only of specific sentences

查看:154
本文介绍了使用Stanford CoreNLP进行懒惰解析,以获得特定句子的情感的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找优化斯坦福CoreNLP情绪管道性能的方法。因此,想要获得句子的情感,但只想包含作为输入的特定关键词。

我尝试了两种方法:

方法1:StanfordCoreNLP管道用情绪注释整个文本

我已经定义了一个管道注释器:tokenize,ssplit,parse,sentiment。我在整篇文章中运行它,然后在每个句子中查找关键字,如果它们存在,则运行返回关键字值的方法。虽然处理需要几秒钟,但我并不满意。

I have defined a pipeline of annotators: tokenize, ssplit, parse, sentiment. I have run it on entire article, then looked for keywords in each sentence and, if they were present, run a method returning keyword value. I was not satisfied though that processing takes a couple of seconds.

这是代码:

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

方法2:StanfordCoreNLP管道注释整个文本句子,在感兴趣的句子上运行单独的注释器

由于第一个解决方案的性能较弱,我已经定义了第二个解决方案。我已经定义了一个管道使用注释器:tokenize,ssplit。我在每个句子中查找关键字,如果它们存在,我只为这个句子创建了一个注释并在其上运行下一个注释器:ParserAnnotator,BinarizerAnnotator和SentimentAnnotator。

Because of the weak performance of the first solution, I have defined the second solution. I have defined a pipeline with annotators: tokenize, ssplit. I looked for keywords in each sentence and, if they were present, I have created an annotation only for this sentence and run next annotators on it: ParserAnnotator, BinarizerAnnotator and SentimentAnnotator.

由于ParserAnnotator,结果真的不令人满意。即使我用相同的属性初始化它。有时它花费的时间比e更多方法1中的文档运行ntire管道。

The results were really unsatisfying because of ParserAnnotator. Even if I initialized it with the same properties. Sometimes it took even more time than entire pipeline run on a document in Approach 1.

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator  binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);

Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        // code required to perform annotation on one sentence
        List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
        listWithSentence.add(sentence);
        Annotation sentenceAnnotation  = new Annotation(listWithSentence);

        parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
        binarizerAnnotator.annotate(sentenceAnnotation);
        sentimentAnnotator.annotate(sentenceAnnotation);
        sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);

        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

问题


  1. 我想知道为什么在CoreNLP中进行解析不是懒惰? (在我的例子中,这意味着:只有在调用句子的情绪时才会执行)。这是出于性能原因吗?

  1. I wonder why parsing in CoreNLP is not "lazy"? (In my example that would mean: performed only when sentiment on a sentence is called). Is it from performance reasons?

为什么一个句子的解析器几乎和整个文章的解析器一样长(我的文章有7个句子)?是否可以以更快的方式配置它?

How come a parser for one sentence can work almost as long as a parser for entire article (my article had 7 sentences)? Is it possible to configure it in a way that it works faster?


推荐答案

如果您希望加快选区解析,唯一最好的改进是使用新的转移 - 减少选区解析器。它比默认的PCFG解析器快几个数量级。

If you're looking to speed up constituency parsing, the single best improvement is to use the new shift-reduce constituency parser. It is orders of magnitude faster than the default PCFG parser.

您以后的问题的答案:


  1. 为什么CoreNLP解析不是懒惰的?这当然是可能的,但不是我们在管道中实现的东西。在必要的情况下,我们可能没有在内部看到很多用例。如果你有兴趣制作一个lazy annotator wrapper,我们将很高兴地接受它们的贡献!

  2. 为什么一个句子的解析器几乎和一个句子一样长整篇文章的解析器?默认的Stanford PCFG解析器是尊重立方时间复杂度到句长。这就是我们通常建议出于性能原因限制最大句子长度的原因。另一方面,shift-reduce解析器相对于句子长度以线性时间运行。

  1. Why is CoreNLP parsing not lazy? This is certainly possible, but not something that we've implemented yet in the pipeline. We likely haven't seen many use cases in-house where this is necessary. We will happily accept a contribution of a "lazy annotator wrapper" if you're interested in making one!
  2. How come a parser for one sentence can work almost as long as a parser for an entire article? The default Stanford PCFG parser is cubic time complexity with respect to the sentence length. This is why we usually recommend restricting the maximum sentence length for performance reasons. The shift-reduce parser, on the other hand, runs in linear time with respect to the sentence length.

这篇关于使用Stanford CoreNLP进行懒惰解析,以获得特定句子的情感的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆