可以在管道处理期间从spaCy文档中删除令牌吗? [英] Can a token be removed from a spaCy document during pipeline processing?
问题描述
我正在使用spaCy(一个出色的Python NLP库)来处理许多非常大的文档,但是,我的语料库中有许多我想在文档处理管道中消除的常用词.有没有办法从管道组件中的文档中删除令牌?
I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?
推荐答案
spaCy的标记化是非破坏性的,因此它始终代表原始输入文本,并且从不添加或删除任何内容.这是Doc
对象的核心原则:您应该始终能够重建和再现原始输入文本.
spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc
object: you should always be able to reconstruct and reproduce the original input text.
虽然您可以 解决此问题,但通常有更好的方法可以实现相同的目的而又不会破坏输入文本↔Doc
文本的一致性.一种解决方案是添加自定义扩展属性,如is_excluded
令牌,具体取决于您要使用的目标:
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc
text consistency. One solution would be to add a custom extension attribute like is_excluded
to the tokens, based on whatever objective you want to use:
from spacy.tokens import Token
def get_is_excluded(token):
# Getter function to determine the value of token._.is_excluded
return token.text in ['some', 'excluded', 'words']
Token.set_extension('is_excluded', getter=get_is_excluded)
在处理Doc
时,您现在可以对其进行过滤,以仅获取未排除的令牌:
When processing a Doc
, you can now filter it to only get the tokens that are not excluded:
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
您还可以使用 Matcher
或PhraseMatcher
在上下文中查找标记序列并将其标记为已排除.
You can also make this more complex by using the Matcher
or PhraseMatcher
to find sequences of tokens in context and mark them as excluded.
此外,出于完整性考虑:如果您确实希望更改 Doc
中的令牌,您可以通过使用words
(字符串列表)和可选的spaces
(指示令牌是否带空格的布尔值列表)构造一个新的Doc
对象来实现此目的.要使用词性标签或依赖项标签等属性构造Doc
,则可以调用 Doc.from_array
方法,该方法具有要设置的属性和值的numpy数组(所有ID).
Also, for completeness: If you do want to change the tokens in a Doc
, you can achieve this by constructing a new Doc
object with words
(a list of strings) and optional spaces
(a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc
with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array
method with the attributes to set and a numpy array of the values (all IDs).
这篇关于可以在管道处理期间从spaCy文档中删除令牌吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!