在管道处理期间可以从 spaCy 文档中删除令牌吗? [英] Can a token be removed from a spaCy document during pipeline processing?

查看:19
本文介绍了在管道处理期间可以从 spaCy 文档中删除令牌吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 spaCy(一个很棒的 Python NLP 库)来处理许多非常大的文档,但是,我的语料库中有许多我想在文档处理管道中消除的常用词.有没有办法从管道组件内的文档中删除令牌?

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?

推荐答案

spaCy 的标记化非破坏性,因此它始终代表原始输入文本,从不添加或删除任何内容.这是 Doc 对象的核心原则:您应该总是能够重构和再现原始输入文本.

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

虽然您可以解决这个问题,但通常有更好的方法可以在不破坏输入文本↔ Doc 文本一致性的情况下实现相同的目标.一种解决方案是添加一个自定义扩展属性,如is_excluded 到令牌,基于您要使用的任何目标:

While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:

from spacy.tokens import Token

def get_is_excluded(token):
    # Getter function to determine the value of token._.is_excluded
    return token.text in ['some', 'excluded', 'words']

Token.set_extension('is_excluded', getter=get_is_excluded)

在处理 Doc 时,您现在可以对其进行过滤以仅获取未排除的标记:

When processing a Doc, you can now filter it to only get the tokens that are not excluded:

doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']

您还可以通过使用 MatcherPhraseMatcher 以在上下文中查找标记序列并将它们标记为已排除.

You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.

此外,为了完整性:如果您确实想更改 Doc,您可以通过使用 words(字符串列表)和可选的 spaces(字符串列表)构造一个新的 Doc 对象来实现这一点.指示标记后是否跟空格的布尔值).要使用词性标签或依赖标签等属性构建 Doc,您可以调用 Doc.from_array 方法,带有要设置的属性和值的 numpy 数组(所有 ID).

Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

这篇关于在管道处理期间可以从 spaCy 文档中删除令牌吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆