令牌应使用哪些设置 [英] Which settings should be used for TokensregexNER

查看:107
本文介绍了令牌应使用哪些设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试使用regexner时,它可以在以下设置和数据下正常工作;

When I try regexner it works as expected with the following settings and data;

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");

法学学士学位
(艺术|法律|科学|工程|神学)学士学位

Bachelor of Laws DEGREE
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE

我想做的是使用TokenRegex.例如

What I would like to do is that using TokenRegex. For example

法学学士学位
([[{tag:NNS}] [{tag:NNP}])学士学位

Bachelor of Laws DEGREE
Bachelor of ([{tag:NNS}] [{tag:NNP}]) DEGREE

我读到要做到这一点,我应该使用TokensregexNERAnnotator.

I read that to do this, I should use TokensregexNERAnnotator.

我试图按如下方式使用它,但是它不起作用.

I tried to use it as follows, but it did not work.

Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));

或者我尝试以另一种方式设置注释器,

Or I tried setting annotator in another way,

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");    
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");

我尝试使用不同的TokenRegex格式,但注释者找不到表达式或出现SyntaxException.

I tried to different TokenRegex formats but either annotator could not find the expression or I got SyntaxException.

在NER数据文件上使用TokenRegex(带有标签的令牌查询)的正确方法是什么?

What is the proper way to use TokenRegex (query on tokens with tags) on NER data file ?

顺便说一句,我只是在TokensRegexNERAnnotator.java文件中看到一条注释.不知道它是否相关pos标签不适用于RegexNerAnnotator.

BTW I just see a comment in TokensRegexNERAnnotator.java file. Not sure if it is related pos tags does not work with RegexNerAnnotator.

if (entry.tokensRegex != null) {
    // TODO: posTagPatterns...
    pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
  }

推荐答案

首先,您需要制作一个TokensRegex规则文件(sample_degree.rules).这是一个示例:

First you need to make a TokensRegex rule file (sample_degree.rules). Here is an example:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }

为了稍微解释一下规则,pattern字段指定要匹配的模式类型. action字段用于注释总体匹配中的每个标记(即$0表示的内容),注释ner字段(请注意,我们也在规则文件中指定了ner = ...,并且第三个参数是将字段设置为字符串"DEGREE".

To explain the rule a bit, the pattern field is specifying what type of pattern to match. The action field is saying to annotate every token in the overall match (that is what $0 represents), annotate the ner field (note that we specified ner = ... in the rule file as well, and the third parameter is saying set the field to the String "DEGREE").

然后为命令创建此.props文件(degree_example.props):

Then make this .props file (degree_example.props) for the command:

customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator

tokensregex.rules = sample_degree.rules

annotators = tokenize,ssplit,pos,lemma,ner,tokensregex

然后运行以下命令:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text

您应该看到要标记为"DEGREE"的三个标记都将被标记.

You should see that the three tokens you wanted tagged as "DEGREE" will be tagged.

我想我将对代码进行更改,以使tokensregex链接到TokensRegexAnnotator,因此您不必将其指定为自定义注释器. 但是现在您需要在.props文件中添加该行.

I think I will push a change to the code to make tokensregex link to the TokensRegexAnnotator so you won't have to specify it as a custom annotator. But for now you need to add that line in the .props file.

此示例应有助于实现此目标.如果您想了解更多,这里有更多资源:

This example should help in implementing this. Here are some more resources if you want to learn more:

http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules

http://nlp .stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html

这篇关于令牌应使用哪些设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆