使用Stanza和CoreNLPClient提取名词短语 [英] Extract Noun Phrases with Stanza and CoreNLPClient
问题描述
我正在尝试使用Stanza(使用Stanford CoreNLP)从句子中提取名词短语.这只能通过Stanza中的CoreNLPClient模块来完成.
I am trying to extract noun phrases from sentences using Stanza(with Stanford CoreNLP). This can only be done with the CoreNLPClient module in Stanza.
# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')
这是一个句子的示例,我正在客户端中使用tregrex
函数来获取所有名词短语. Tregex
函数在python中返回dict of dicts
.因此,我需要先处理tregrex
的输出,然后再将其传递给NLTK中的Tree.fromstring
函数,以正确地提取名词短语作为字符串.
Here is an example of a sentence, and I am using the tregrex
function in client to get all the noun phrases. Tregex
function returns a dict of dicts
in python. Thus I needed to process the output of the tregrex
before passing it to the Tree.fromstring
function in NLTK to correctly extract the Noun phrases as strings.
pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``
因此,我想出了方法stanza_phrases
,该方法必须循环遍历dict of dicts
,这是tregrex
的输出,并正确格式化NLTK中的Tree.fromstring
.
Hence, I came up with the method stanza_phrases
which has to loop through the dict of dicts
which is the output of tregrex
and correctly format for Tree.fromstring
in NLTK.
def stanza_phrases(matches):
Nps = []
for match in matches:
for items in matches['sentences']:
for keys,values in items.items():
s = '(ROOT\n'+ values['match']+')'
Nps.extend(extract_phrase(s, pattern))
return set(Nps)
生成要由NLTK使用的树
generates a tree to be used by NLTK
from nltk.tree import Tree
def extract_phrase(tree_str, label):
phrases = []
trees = Tree.fromstring(tree_str)
for tree in trees:
for subtree in tree.subtrees():
if subtree.label() == label:
t = subtree
t = ' '.join(t.leaves())
phrases.append(t)
return phrases
这是我的输出:
{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity', 'the theory', 'the theory of relativity'}
有没有办法让我用更少的行数(尤其是stanza_phrases
和extract_phrase
方法)来提高代码效率
Is there a way I can make this more code efficient with less number of lines (especially, stanza_phrases
and extract_phrase
methods)
推荐答案
from stanza.server import CoreNLPClient
# get noun phrases with tregex
def noun_phrases(_client, _text, _annotators=None):
pattern = 'NP'
matches = _client.tregex(_text,pattern,annotators=_annotators)
print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))
# English example
with CoreNLPClient(timeout=30000, memory='16G') as client:
englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
print('---')
print(englishText)
noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")
# French example
with CoreNLPClient(properties='french', timeout=30000, memory='16G') as client:
frenchText = "Je suis John."
print('---')
print(frenchText)
noun_phrases(client,frenchText,_annotators="tokenize,ssplit,mwt,pos,lemma,parse")
这篇关于使用Stanza和CoreNLPClient提取名词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!