如何用Stanford Parser解析英语以外的语言?在Java中而不是在命令行中 [英] How to parse languages other than English with Stanford Parser? in java, not command lines

查看:62
本文介绍了如何用Stanford Parser解析英语以外的语言?在Java中而不是在命令行中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图在我的Java程序中使用Stanford Parser来解析一些中文句子.由于我在Java和Stanford Parser上都是新手,因此我使用了'ParseDemo.java'进行练习.该代码可以很好地处理英语句子,并输出正确的结果.但是,当我将模型更改为"chinesePCFG.ser.gz"并尝试解析某些分段的中文句子时,出现了问题.

I have been trying to use Stanford Parser in my Java program to parse some sentences in Chinese. Since I am quite new at both Java and Stanford Parser, I used the 'ParseDemo.java' to practice. The code works fine with sentences in English and outputs the right result. However, when I changed the model to 'chinesePCFG.ser.gz' and tried to parse some segmented Chinese sentences, things went wrong.

这是我的Java代码

class ParserDemo {

  public static void main(String[] args) {
    LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
    if (args.length > 0) {
      demoDP(lp, args[0]);
    } else {
      demoAPI(lp);
    }
  }

  public static void demoDP(LexicalizedParser lp, String filename) {
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenier here (as below) and pass it
    // to DocumentPreprocessor
    for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
      Tree parse = lp.apply(sentence);
      parse.pennPrint();
      System.out.println();

      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
      Collection tdl = gs.typedDependenciesCCprocessed(true);
      System.out.println(tdl);
      System.out.println();
    }
  }

  public static void demoAPI(LexicalizedParser lp) {
    // This option shows parsing a list of correctly tokenized words
    String sent[] = { "我", "是", "一名", "学生" };
    List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

  private ParserDemo() {} // static methods only
}

它与ParserDemo.java基本相同,但是当我运行它时,会得到以下结果:

It's basically the same as ParserDemo.java, but when I run it I get the following result:

从序列化文件加载解析器 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ...完成[2.2 秒]. (根(IP (NP(PN我)) (副总裁(VC是) (NP (QP(CD一名)) (NP(NN学生))))))

Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ... done [2.2 sec]. (ROOT (IP (NP (PN 我)) (VP (VC 是) (NP (QP (CD 一名)) (NP (NN 学生))))))

线程"main"中的异常java.lang.RuntimeException:无法执行 召集公众 edu.stanford.nlp.trees.English语法结构(edu.stanford.nlp.trees.Tree) 在 edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory.java:104) 在parserdemo.ParserDemo.demoAPI(ParserDemo.java:65)在 parserdemo.ParserDemo.main(ParserDemo.java:23)

Exception in thread "main" java.lang.RuntimeException: Failed to invoke public edu.stanford.nlp.trees.EnglishGrammaticalStructure(edu.stanford.nlp.trees.Tree) at edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory.java:104) at parserdemo.ParserDemo.demoAPI(ParserDemo.java:65) at parserdemo.ParserDemo.main(ParserDemo.java:23)

第65行的代码是:

 GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

我的猜测是chinesePCFG.ser.gz缺少与'edu.stanford.nlp.trees.EnglishGrammaticalStructure'相关的内容.由于解析器是通过命令行正确解析中文的,所以我自己的代码肯定有问题.我一直在搜索,仅发现了一些类似的案例,其中有些提到使用正确的模型,但是我真的不知道如何将代码修改为正确的模型".希望有人可以帮助我.我是Java和Stanford Parser的新手,所以请具体说明.谢谢!

My guess is that chinesePCFG.ser.gz misses something relevant to 'edu.stanford.nlp.trees.EnglishGrammaticalStructure'. Since the parser parses Chinese correctly via commandlines, there must be something wrong with my own code. I have been searching, only to find few similar cases some of which mentioned about using the right model, but I don't really know how to modify the code to the 'right model'. Hope that someone could help me with it. I am a newbie on Java and Stanford Parser, so please be specific. Thank you!

推荐答案

问题是GrammaticalStructureFactory是由PennTreebankLanguagePack构造的,它是针对英语Penn Treebank的.您需要使用(在两个地方)

The problem is that the GrammaticalStructureFactory is constructed from a PennTreebankLanguagePack, which is for the English Penn Treebank. You need to use (in two places)

TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();

并适当导入

import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;

但是我们通常也建议对中文使用分解式解析器(因为它的工作效果明显好于英语,尽管它会占用更多的内存和更多的时间)

But we also generally recommend using the factored parser for Chinese (since it works considerably better, unlike for English, although at the cost of more memory and time usage)

LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz");

这篇关于如何用Stanford Parser解析英语以外的语言?在Java中而不是在命令行中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆