令牌应使用哪些设置 [英] Which settings should be used for TokensregexNER
问题描述
当我尝试使用regexner时,它可以在以下设置和数据下正常工作;
When I try regexner it works as expected with the following settings and data;
props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");
法学学士学位
(艺术|法律|科学|工程|神学)学士学位
Bachelor of Laws DEGREE
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE
我想做的是使用TokenRegex.例如
What I would like to do is that using TokenRegex. For example
法学学士学位
([[{tag:NNS}] [{tag:NNP}])学士学位
Bachelor of Laws DEGREE
Bachelor of ([{tag:NNS}] [{tag:NNP}]) DEGREE
我读到要做到这一点,我应该使用TokensregexNERAnnotator.
I read that to do this, I should use TokensregexNERAnnotator.
我试图按如下方式使用它,但是它不起作用.
I tried to use it as follows, but it did not work.
Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));
或者我尝试以另一种方式设置注释器,
Or I tried setting annotator in another way,
props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");
我尝试使用不同的TokenRegex格式,但注释者找不到表达式或出现SyntaxException.
I tried to different TokenRegex formats but either annotator could not find the expression or I got SyntaxException.
在NER数据文件上使用TokenRegex(带有标签的令牌查询)的正确方法是什么?
What is the proper way to use TokenRegex (query on tokens with tags) on NER data file ?
顺便说一句,我只是在TokensRegexNERAnnotator.java文件中看到一条注释.不知道它是否相关pos标签不适用于RegexNerAnnotator.
BTW I just see a comment in TokensRegexNERAnnotator.java file. Not sure if it is related pos tags does not work with RegexNerAnnotator.
if (entry.tokensRegex != null) {
// TODO: posTagPatterns...
pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
}
推荐答案
首先,您需要制作一个TokensRegex规则文件(sample_degree.rules).这是一个示例:
First you need to make a TokensRegex rule file (sample_degree.rules). Here is an example:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }
为了稍微解释一下规则,pattern
字段指定要匹配的模式类型. action
字段用于注释总体匹配中的每个标记(即$0
表示的内容),注释ner
字段(请注意,我们也在规则文件中指定了ner = ...,并且第三个参数是将字段设置为字符串"DEGREE".
To explain the rule a bit, the pattern
field is specifying what type of pattern to match. The action
field is saying to annotate every token in the overall match (that is what $0
represents), annotate the ner
field (note that we specified ner = ... in the rule file as well, and the third parameter is saying set the field to the String "DEGREE").
然后为命令创建此.props文件(degree_example.props):
Then make this .props file (degree_example.props) for the command:
customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator
tokensregex.rules = sample_degree.rules
annotators = tokenize,ssplit,pos,lemma,ner,tokensregex
然后运行以下命令:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text
您应该看到要标记为"DEGREE"的三个标记都将被标记.
You should see that the three tokens you wanted tagged as "DEGREE" will be tagged.
我想我将对代码进行更改,以使tokensregex
链接到TokensRegexAnnotator,因此您不必将其指定为自定义注释器.
但是现在您需要在.props文件中添加该行.
I think I will push a change to the code to make tokensregex
link to the TokensRegexAnnotator so you won't have to specify it as a custom annotator.
But for now you need to add that line in the .props file.
此示例应有助于实现此目标.如果您想了解更多,这里有更多资源:
This example should help in implementing this. Here are some more resources if you want to learn more:
http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html
这篇关于令牌应使用哪些设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!