在管道处理期间可以从 spaCy 文档中删除令牌吗? [英] Can a token be removed from a spaCy document during pipeline processing?
问题描述
我正在使用 spaCy(一个很棒的 Python NLP 库)来处理许多非常大的文档,但是,我的语料库中有许多我想在文档处理管道中消除的常用词.有没有办法从管道组件内的文档中删除令牌?
I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?
推荐答案
spaCy 的标记化非破坏性,因此它始终代表原始输入文本,从不添加或删除任何内容.这是 Doc
对象的核心原则:您应该总是能够重构和再现原始输入文本.
spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc
object: you should always be able to reconstruct and reproduce the original input text.
虽然您可以解决这个问题,但通常有更好的方法可以在不破坏输入文本↔ Doc
文本一致性的情况下实现相同的目标.一种解决方案是添加一个自定义扩展属性,如is_excluded
到令牌,基于您要使用的任何目标:
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc
text consistency. One solution would be to add a custom extension attribute like is_excluded
to the tokens, based on whatever objective you want to use:
from spacy.tokens import Token
def get_is_excluded(token):
# Getter function to determine the value of token._.is_excluded
return token.text in ['some', 'excluded', 'words']
Token.set_extension('is_excluded', getter=get_is_excluded)
在处理 Doc
时,您现在可以对其进行过滤以仅获取未排除的标记:
When processing a Doc
, you can now filter it to only get the tokens that are not excluded:
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
您还可以通过使用 Matcher
或 PhraseMatcher
以在上下文中查找标记序列并将它们标记为已排除.
You can also make this more complex by using the Matcher
or PhraseMatcher
to find sequences of tokens in context and mark them as excluded.
此外,为了完整性:如果您确实想更改 Doc
,您可以通过使用 words
(字符串列表)和可选的 spaces
(字符串列表)构造一个新的 Doc
对象来实现这一点.指示标记后是否跟空格的布尔值).要使用词性标签或依赖标签等属性构建 Doc
,您可以调用 Doc.from_array
方法,带有要设置的属性和值的 numpy 数组(所有 ID).
Also, for completeness: If you do want to change the tokens in a Doc
, you can achieve this by constructing a new Doc
object with words
(a list of strings) and optional spaces
(a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc
with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array
method with the attributes to set and a numpy array of the values (all IDs).
这篇关于在管道处理期间可以从 spaCy 文档中删除令牌吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!