我可以从Stanford CoreNLP中TokensRegex匹配的结果中获得一个entityMention吗? [英] Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?

查看:270
本文介绍了我可以从Stanford CoreNLP中TokensRegex匹配的结果中获得一个entityMention吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想向NER管道添加地址(可能还有其他基于规则的实体),令牌Regex似乎是这样做的非常有用的DSL.在 https://stackoverflow.com/a/42604225 之后,我创建了以下规则文件:

I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }

这是一个scala repl会话,显示了我如何尝试建立注释管道.

Here's a scala repl session showing how I'm trying to set up an annotation pipeline.

@ import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}

@ import edu.stanford.nlp.util.PropertiesUtils.asProperties

@ val pipe = new StanfordCoreNLP(asProperties(
  "customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
  "annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
  "ner.combinationMode", "HIGH_RECALL",
  "tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP@2ce6a051

@ val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland

@ pipe.annotate(doc)

@ doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]

@ doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]

如您所见,该地址在句子的nerTags中已正确标记,但未在文档entityMentions中显示.有办法吗?

As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?

此外,从文档中可以找到从单个匹配中识别tokenregex的两个相邻匹配的方法(假设我有一组更复杂的regexes;在当前示例中,我仅匹配3个令牌,因此我只可以计数令牌)?

Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?

我尝试使用regexner和令牌正则表达式来处理它,此处描述了令牌 https://stanfordnlp.github.io/CoreNLP/regexner.html ,但我似乎无法正常工作.

I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.

自从我在Scala中工作以来,我很乐意深入Java API来使它起作用,而不是在必要时摆弄属性和资源文件.

Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.

推荐答案

是的,我最近添加了一些更改(在GitHub版本中),以简化此操作!确保从GitHub下载最新版本.尽管我们打算很快发布Stanford CoreNLP 3.9.2,但它将进行这些更改.

Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.

如果您阅读此页面,则可以了解NERCombinerAnnotator运行的完整NER管道.

If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.

https://stanfordnlp.github.io/CoreNLP/ner.html

此外,这里的TokensRegex上有很多文章:

Furthermore there is a lot of write up on the TokensRegex here:

https://stanfordnlp.github.io/CoreNLP/tokensregex.html

基本上,您要做的是运行ner注释器,并使用它的TokensRegex子注释器.假设您在名为my_ner.rules的文件中有一些命名实体规则.

Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.

您可以运行以下命令:

java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt

这将在完整的命名实体识别过程中运行TokensRegex子注释器.然后,当实体提及的最后一步运行时,它将对提取的命名实体的规则进行操作,并从它们中创建实体提及.

This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.

这篇关于我可以从Stanford CoreNLP中TokensRegex匹配的结果中获得一个entityMention吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆