使用 spacy 从文档中删除命名实体 [英] Removing named entities from a document using spacy

查看:63
本文介绍了使用 spacy 从文档中删除命名实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从文档中删除被 spacy 认为是命名实体的单词,因此基本上从字符串示例中删除了瑞典"和诺基亚".我找不到解决实体存储为跨度的问题的方法.因此,当将它们与 spacy 文档中的单个标记进行比较时,会提示错误.

在后面的步骤中,这个过程应该是一个应用于存储在pandas数据框中的多个文本文档的函数.

对于如何更好地发布问题的任何帮助和建议,我将不胜感激,因为这是我在这里的第一个问题.

<预><代码>nlp = spacy.load('en')text_data = u'这是一个关于瑞典和诺基亚等实体的文本文档'文档 = nlp(text_data)text_no_namedentities = []对于文档中的单词:如果 word 不在 document.ents 中:text_no_namedentities.append(word)返回".join(text_no_namedentities)

它会产生以下错误:

<块引用>

TypeError: Argument 'other' 的类型不正确(预期为 spacy.tokens.token.Token,得到 spacy.tokens.span.Span)

解决方案

这会让你得到你想要的结果.查看命名实体识别应该会帮助您继续前进.

import spacynlp = spacy.load('en_core_web_sm')text_data = '这是一个关于瑞典和诺基亚等实体的文本文档'文档 = nlp(text_data)text_no_namedentities = []ents = [e.text for e in document.ents]对于文档中的项目:如果 item.text 在 ents:经过别的:text_no_namedentities.append(item.text)打印(" ".join(text_no_namedentities))

输出:

这是一个文本文档,涉及实体,例如和

I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.

In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.

I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.


nlp = spacy.load('en')

text_data = u'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

for word in document:
    if word not in document.ents:
        text_no_namedentities.append(word)

return " ".join(text_no_namedentities)

It creates the following error:

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)

解决方案

This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output:

This is a text document that speaks about entities like and

这篇关于使用 spacy 从文档中删除命名实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆