神经网络相关性分析器中的TreebankLanguagePack函数 [英] TreebankLanguagePack function in Neural Network Dependency Parser

查看:86
本文介绍了神经网络相关性分析器中的TreebankLanguagePack函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我想训练斯坦福神经网络依赖解析器的另一种语言,则需要"treebankLanguagePack"(TLP),但是有关此TLP的信息非常有限:

If I want to train the Stanford Neural Network Dependency Parser for another language, there is a need for a "treebankLanguagePack"(TLP) but the information about this TLP is very limited:

树库的特殊性及其包含的语言

particularities of your treebank and the language it contains

如果我的"treebank"使用的另一种语言的格式与PTB相同,并且我的数据使用CONLL格式.依赖性格式遵循通用依赖性" UD.我需要这个TLP吗?

If I have my "treebank" in another language that follows the same format as PTB, and my data is using CONLL format. The dependency format follows the "Universal Dependency" UD. Do I need this TLP?

推荐答案

从当前的CoreNLP版本开始,在依赖解析器中仅使用TreebankLanguagePack进行以下操作:1)确定输入文本的编码,2)确定哪些标记为标点符号[1].

As of the current CoreNLP release, the TreebankLanguagePack is used within the dependency parser only to 1) determine the input text encoding and 2) determine which tokens count as punctuation [1].

那么,快速解决方案的最佳选择是坚持使用UD English TreebankLanguagePack.您应该通过将属性language指定为"UniversalEnglish"来执行此操作(是否通过代码或命令行访问依赖项解析器).如果您通过CoreNLP主入口点使用依赖项解析器,则此属性键应为depparse.language.

Your best bet for a quick solution, then, is probably to stick with the UD English TreebankLanguagePack. You should do this by specifying the property language as "UniversalEnglish" (whether you're accessing the dependency parser via code or command line). If you're using the dependency parser via the CoreNLP main entry point, this property key should be depparse.language.

接下来有两个非常微妙的细节.如果您刚开始尝试一起破解某些东西,则可能不必担心这些问题,但是最好提一下,这样可以避免将来出现世界末日的/令人头疼的错误.

Two very subtle details follow. You probably don't need to worry about these if you're just trying to hack something together at first, but it's probably good to mention so that you can avoid apocalyptic / head-smashing bugs in the future.

  • 评估和标点符号::如果您选择坚持使用UniversalEnglish,请注意
  • Evaluation and punctuation: If you do choose to stick with UniversalEnglish, be aware that there is a hack in the evaluation code that overrides the punctuation set for English parsing in particular. Any changes you make to punctuation in PennTreebankLanguagePack (the TLP used for the UniversalEnglish language) will be ignored! If you need to get around this, it should be enough to copy and paste the PennTreebankLanguagePack into your own codebase and name it something different.
  • Potential memory leak: When building parse results to be returned to the user, the dependency parser draws from a pool of cached GrammaticalRelation objects. This cache does not live-update. This means that if you have relations which aren't formally defined in the language you specified via the language property, they will lead to the instantiation of a new object whenever those relations show up in parser predictions. (This can be a big deal memory-wise if you happen to store the parse objects somewhere.)

[1]:在评估期间不包括标点符号.这是整个依赖项解析文献中使用的标准作弊".

[1]: Punctuation is excluded during evaluation. This is a standard "cheat" used throughout the dependency parsing literature.

这篇关于神经网络相关性分析器中的TreebankLanguagePack函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆