可以在管道处理期间从spaCy文档中删除令牌吗? [英] Can a token be removed from a spaCy document during pipeline processing?

查看:81
本文介绍了可以在管道处理期间从spaCy文档中删除令牌吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spaCy(一个出色的Python NLP库)来处理许多非常大的文档,但是,我的语料库中有许多我想在文档处理管道中消除的常用词.有没有办法从管道组件中的文档中删除令牌?

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?

推荐答案

spaCy的标记化是非破坏性的,因此它始终代表原始输入文本,并且从不添加或删除任何内容.这是Doc对象的核心原则:您应该始终能够重建和再现原始输入文本.

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

虽然您可以 解决此问题,但通常有更好的方法可以实现相同的目的而又不会破坏输入文本↔Doc文本的一致性.一种解决方案是添加自定义扩展属性,如is_excluded令牌,具体取决于您要使用的目标:

While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:

from spacy.tokens import Token

def get_is_excluded(token):
    # Getter function to determine the value of token._.is_excluded
    return token.text in ['some', 'excluded', 'words']

Token.set_extension('is_excluded', getter=get_is_excluded)

在处理Doc时,您现在可以对其进行过滤,以仅获取未排除的令牌:

When processing a Doc, you can now filter it to only get the tokens that are not excluded:

doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']

您还可以使用 MatcherPhraseMatcher 在上下文中查找标记序列并将其标记为已排除.

You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.

此外,出于完整性考虑:如果您确实希望更改 Doc 中的令牌,您可以通过使用words(字符串列表)和可选的spaces(指示令牌是否带空格的布尔值列表)构造一个新的Doc对象来实现此目的.要使用词性标签或依赖项标签等属性构造Doc,则可以调用 Doc.from_array 方法,该方法具有要设置的属性和值的numpy数组(所有ID).

Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

这篇关于可以在管道处理期间从spaCy文档中删除令牌吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆