如何用斯坦福解析器解析英语以外的语言? 在 Java 中,而不是命令行 [英] How to parse languages other than English with Stanford Parser? in java, not command lines

查看:24
本文介绍了如何用斯坦福解析器解析英语以外的语言? 在 Java 中,而不是命令行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试在我的Java程序中使用Stanford Parser来解析一些中文句子.由于我对 Java 和斯坦福解析器都很陌生,所以我使用了ParseDemo.java"来练习.该代码适用于英语句子并输出正确的结果.但是,当我将模型更改为chinesePCFG.ser.gz"并尝试解析一些分段的中文句子时,出现问题.

I have been trying to use Stanford Parser in my Java program to parse some sentences in Chinese. Since I am quite new at both Java and Stanford Parser, I used the 'ParseDemo.java' to practice. The code works fine with sentences in English and outputs the right result. However, when I changed the model to 'chinesePCFG.ser.gz' and tried to parse some segmented Chinese sentences, things went wrong.

这是我的 Java 代码

Here's my code in Java

class ParserDemo {

  public static void main(String[] args) {
    LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
    if (args.length > 0) {
      demoDP(lp, args[0]);
    } else {
      demoAPI(lp);
    }
  }

  public static void demoDP(LexicalizedParser lp, String filename) {
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenier here (as below) and pass it
    // to DocumentPreprocessor
    for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
      Tree parse = lp.apply(sentence);
      parse.pennPrint();
      System.out.println();

      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
      Collection tdl = gs.typedDependenciesCCprocessed(true);
      System.out.println(tdl);
      System.out.println();
    }
  }

  public static void demoAPI(LexicalizedParser lp) {
    // This option shows parsing a list of correctly tokenized words
    String sent[] = { "我", "是", "一名", "学生" };
    List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

  private ParserDemo() {} // static methods only
}

它和 ParserDemo.java 基本相同,但是当我运行它时,我得到以下结果:

It's basically the same as ParserDemo.java, but when I run it I get the following result:

从序列化文件加载解析器edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ... 完成 [2.2秒].(根(IP(NP (PN 我))(VP (VC 是)(NP(QP (CD 一名))(NP (NN 学生))))))

Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ... done [2.2 sec]. (ROOT (IP (NP (PN 我)) (VP (VC 是) (NP (QP (CD 一名)) (NP (NN 学生))))))

线程main"中的异常 java.lang.RuntimeException: Failed to调用公共edu.stanford.nlp.trees.EnglishGrammaticalStructure(edu.stanford.nlp.trees.Tree)在edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory.java:104)在 parserdemo.ParserDemo.demoAPI(ParserDemo.java:65) 在parserdemo.ParserDemo.main(ParserDemo.java:23)

Exception in thread "main" java.lang.RuntimeException: Failed to invoke public edu.stanford.nlp.trees.EnglishGrammaticalStructure(edu.stanford.nlp.trees.Tree) at edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory.java:104) at parserdemo.ParserDemo.demoAPI(ParserDemo.java:65) at parserdemo.ParserDemo.main(ParserDemo.java:23)

第 65 行的代码是:

the code on line 65 is:

 GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

我的猜测是 chinesePCFG.ser.gz 遗漏了一些与edu.stanford.nlp.trees.EnglishGrammaticalStructure"相关的内容.由于解析器通过命令行正确解析中文,所以一定是我自己的代码有问题.我一直在搜索,只找到了一些类似的案例,其中一些提到了使用正确的模型,但我真的不知道如何将代码修改为正确的模型".希望有人可以帮助我.我是 Java 和斯坦福解析器的新手,所以请具体说明.谢谢!

My guess is that chinesePCFG.ser.gz misses something relevant to 'edu.stanford.nlp.trees.EnglishGrammaticalStructure'. Since the parser parses Chinese correctly via commandlines, there must be something wrong with my own code. I have been searching, only to find few similar cases some of which mentioned about using the right model, but I don't really know how to modify the code to the 'right model'. Hope that someone could help me with it. I am a newbie on Java and Stanford Parser, so please be specific. Thank you!

推荐答案

问题是 GrammaticalStructureFactory 是从 PennTreebankLanguagePack 构建的,它是为英语 Penn Treebank 构建的.你需要使用(在两个地方)

The problem is that the GrammaticalStructureFactory is constructed from a PennTreebankLanguagePack, which is for the English Penn Treebank. You need to use (in two places)

TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();

并适当地导入

import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;

但我们通常也建议对中文使用分解式解析器(因为它的效果要好得多,与英文不同,但需要更多的内存和时间使用)

But we also generally recommend using the factored parser for Chinese (since it works considerably better, unlike for English, although at the cost of more memory and time usage)

LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz");

这篇关于如何用斯坦福解析器解析英语以外的语言? 在 Java 中,而不是命令行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆