NLTK 命名实体识别到 Python 列表 [英] NLTK Named Entity recognition to a Python list

查看:29
本文介绍了NLTK 命名实体识别到 Python 列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 NLTK 的 ne_chunk 从文本中提取命名实体:

my_sent = "华盛顿——在 1990 年代纽约警察的一系列虐待之后,布鲁克林的最高联邦检察官洛雷塔·E·林奇 (Loretta E. Lynch) 强硬地谈到了信托破裂的痛苦非裔美国人感到并表示,修复几代人的沟通不畅和不信任的责任落在了执法部门身上."nltk.ne_chunk(my_sent,二进制=真)

但我不知道如何将这些实体保存到列表中?例如.——

打印Entity_list(华盛顿"、纽约"、洛雷塔"、布鲁克林"、非洲")

谢谢.

解决方案

nltk.ne_chunk 返回嵌套的 nltk.tree.Tree 对象,因此您必须遍历Tree 对象到达网元.

看看使用正则表达式的命名实体识别:NLTK

<预><代码>>>>从 nltk 导入 ne_chunk、pos_tag、word_tokenize>>>从 nltk.tree 导入树>>>>>>def get_continuous_chunks(text):... 分块 = ne_chunk(pos_tag(word_tokenize(text)))...continuous_chunk = []... current_chunk = []...对于我分块:...如果 type(i) == 树:... current_chunk.append(" ".join([token for token, pos in i.leaves()]))...如果 current_chunk:... named_entity = " ".join(current_chunk)...如果named_entity不在continuous_chunk中:...continuous_chunk.append(named_entity)... current_chunk = []... 别的:... 继续...返回continuous_chunk...>>>my_sent = "华盛顿——在 1990 年代纽约警察的一系列虐待之后,布鲁克林的高级联邦检察官洛雷塔·E·林奇 (Loretta E. Lynch) 强硬地谈到了非洲裔美国人所感受到的信任破裂的痛苦,说修复几代人的误解和不信任的责任落在了执法部门身上.">>>get_continuous_chunks(my_sent)[华盛顿"、纽约"、洛雷塔 E.林奇"、布鲁克林"]>>>my_sent = "纽约和布鲁克林的天气如何">>>get_continuous_chunks(my_sent)['纽约','布鲁克林']

I used NLTK's ne_chunk to extract named entities from a text:

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."


nltk.ne_chunk(my_sent, binary=True)

But I can't figure out how to save these entities to a list? E.g. –

print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')

Thanks.

解决方案

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs.

Take a look at Named Entity Recognition with Regular Expression: NLTK

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>> 
>>> def get_continuous_chunks(text):
...     chunked = ne_chunk(pos_tag(word_tokenize(text)))
...     continuous_chunk = []
...     current_chunk = []
...     for i in chunked:
...             if type(i) == Tree:
...                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
...             if current_chunk:
...                     named_entity = " ".join(current_chunk)
...                     if named_entity not in continuous_chunk:
...                             continuous_chunk.append(named_entity)
...                             current_chunk = []
...             else:
...                     continue
...     return continuous_chunk
... 
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']


>>> my_sent = "How's the weather in New York and Brooklyn"
>>> get_continuous_chunks(my_sent)
['New York', 'Brooklyn']

这篇关于NLTK 命名实体识别到 Python 列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆