是否可以将spacy与已标记化的输入一起使用? [英] Is it possible to use spacy with already tokenized input?
问题描述
我有一个已经被标记成单词的句子.我想为句子中的每个单词获取语音标签的一部分.当我查看SpaCy中的文档时,我意识到它始于原始句子.我不想这样做,因为在这种情况下,spacy可能会以不同的标记化结束.因此,我想知道是否可以将spaCy与单词列表(而不是字符串)一起使用?
I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?
以下是有关我的问题的示例:
Here is an example about my question:
# I know that it does the following sucessfully :
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text = 'Hello, world.'
doc = nlp(raw_text)
for token in doc:
print(token.pos_)
但是我想做类似以下的事情:
But I want to do something similar to the following:
import spacy
nlp = spacy.load('en_core_web_sm')
tokenized_text = ['Hello',',','world','.']
doc = nlp(tokenized_text)
for token in doc:
print(token.pos_)
我知道这是行不通的,但是有可能做类似的事情吗?
I know, it doesn't work, but is it possible to do something similar to that ?
推荐答案
您可以通过使用自己的替换spaCy的默认标记生成器来做到这一点:
You can do this by replacing spaCy's default tokenizer with your own:
nlp.tokenizer = custom_tokenizer
custom_tokenizer
是将原始文本作为输入并返回Doc
对象的函数.
Where custom_tokenizer
is a function taking raw text as input and returning a Doc
object.
您未指定获取令牌列表的方式.如果您已经有一个接受原始文本并返回令牌列表的函数,则对其进行一些小的更改:
You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:
def custom_tokenizer(text):
tokens = []
# your existing code to fill the list with tokens
# replace this line:
return tokens
# with this:
return Doc(nlp.vocab, tokens)
请参见Doc
上的文档.
如果由于某种原因您无法执行此操作(也许您无权使用令牌化功能),则可以使用字典:
If for some reason you cannot do this (maybe you don't have access to the tokenization function), you can use a dictionary:
tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}
def custom_tokenizer(text):
if text in tokens_dict:
return Doc(nlp.vocab, tokens_dict[text])
else:
raise ValueError('No tokenization available for input.')
无论哪种方式,您都可以像在第一个示例中那样使用管道:
Either way, you can then use the pipeline as in your first example:
doc = nlp('Hello, world.')
这篇关于是否可以将spacy与已标记化的输入一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!