使用stanfordnlp库中的REGEXNER注释作者姓名 [英] Annotate author names using REGEXNER from the stanfordnlp library

查看:147
本文介绍了使用stanfordnlp库中的REGEXNER注释作者姓名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是用PERSON实体注释科学文章中的作者姓名. 我对与这种格式匹配的名称(作者名等日期)特别感兴趣. 例如,我希望对这句话(Minot et al.2000)=>注释Minot作为PERSON. 我使用的是斯坦福大学nlp团队官方页面上的代码的改编版:

My goal is to annotate author names from scientific articles with the entity PERSON. I am particularly interested with the names that match this format (authorname et al. date). For example I would like for this sentence (Minot et al. 2000 ) => to annotate Minot as a PERSON. I am using an adapted version of the code found in the official page of stanford nlp team:

import stanfordnlp

from stanfordnlp.server import CoreNLPClient
# example text
print('---')
print('input text')
print('')

text = "In practice, its scope is broad and includes the analysis of a diverse set of samples such as gut microbiome (Qin et al., 2010), (Minot et al., 2011), environmental (Mizuno et al., 2013) or clinical (Willner et al., 2009), (Negredo et al., 2011), (McMullan et al., 2012) samples."

# set up the client
print('---')
print('starting up Java Stanford CoreNLP Server...')
#Properties dictionary
prop={'regexner.mapping': 'rgxrules.txt', 'annotators': 'tokenize,ssplit,pos,lemma,ner,regexner'}
# set up the client


with CoreNLPClient(properties=prop,timeout=100000, memory='16G',be_quiet=False ) as client:
    # submit the request to the server
    ann = client.annotate(text)
    # get the first sentence
    sentence = ann.sentence[0]

运行代码后,我得到以下错误肯定和错误否定: 内格雷多不是用PERSON注释,而是O,而Minot是CITY,因为它是美国城市之一,但在此特殊句子中,应加上作者的名字.

After running the code I get the following false positives and false negative: Negredo is not annotated with PERSON but rather O, and Minot as CITY because it's one of the american cities but in this particular sentence it should be annotated with the name of an author.

我试图解决此问题的方法是将此行添加到我传递给corenlpclient的rgxrules.txt文件中.这是我在此文件中包含的行:

My attempt to solve this problem was to add this line to the rgxrules.txt file that I pass to the corenlpclient. Here is the line that I have in this file:

[[A-Z][a-z]] /et/ /al\./\tPERSON

这不能解决您可以检查是否运行代码的问题.我也不知道该如何添加这样一个事实,即我只想要与"[[A-Z] [a-z]]"匹配的单词,并且该单词早于et al.要用PERSON注释,而不是整个句子"Minot et al."例如.

This does not solve the problem you can check if you run the code. Also I don't know how to add the fact that I only want the word that matches '[[A-Z][a-z]]' and that comes before et al. to be annotated with PERSON not the whole sentence 'Minot et al.' for example.

任何想法我都可以解决这个问题.

Any idea how I can solve this problem.

谢谢.

推荐答案

在匹配Java正则表达式方面,我很确定你想要类似的东西

In terms of matching java regular expressions, I'm pretty sure you want something like

[A-Za-z]+ et al[.]

但是,我不知道有什么方法可以避免标记et al.,例如先进行令牌先行.如果然后在正则表达式文件中添加另一行以将et al.替换为O会发生什么情况?可能需要说PERSONO

However, I don't know of any way to avoid labeling et al. such as having a token lookahead. What happens if you then add another line to the regex file which replaces et al. with O? Would probably need to say that PERSON is an allowable overwriting for O

这篇关于使用stanfordnlp库中的REGEXNER注释作者姓名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆