是否可以将 spacy 与已经标记化的输入一起使用? [英] Is it possible to use spacy with already tokenized input?

查看:26
本文介绍了是否可以将 spacy 与已经标记化的输入一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个句子已经被标记为单词.我想获取句子中每个单词的词性标签.当我检查 SpaCy 中的文档时,我意识到它以原始句子开头.我不想这样做,因为在这种情况下,spacy 可能会以不同的标记化结束.因此,我想知道是否可以将 spaCy 与单词列表(而不是字符串)一起使用?

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?

这是关于我的问题的一个例子:

Here is an example about my question:

# I know that it does the following sucessfully :
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text = 'Hello, world.'
doc = nlp(raw_text)
for token in doc:
    print(token.pos_)

但我想做类似以下的事情:

But I want to do something similar to the following:

import spacy
nlp = spacy.load('en_core_web_sm')
tokenized_text = ['Hello',',','world','.']
doc = nlp(tokenized_text)
for token in doc:
    print(token.pos_)

我知道,这行不通,但是可以做类似的事情吗?

I know, it doesn't work, but is it possible to do something similar to that ?

推荐答案

你可以用你自己的替换 spaCy 的默认分词器来实现:

You can do this by replacing spaCy's default tokenizer with your own:

nlp.tokenizer = custom_tokenizer

其中 custom_tokenizer 是一个将原始文本作为输入并返回一个 Doc 对象的函数.

Where custom_tokenizer is a function taking raw text as input and returning a Doc object.

您没有说明如何获得令牌列表.如果您已经有一个接收原始文本并返回标记列表的函数,只需对其进行一些小的更改:

You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:

def custom_tokenizer(text):
    tokens = []

    # your existing code to fill the list with tokens

    # replace this line:
    return tokens

    # with this:
    return Doc(nlp.vocab, tokens)

请参阅 Doc 上的 文档.

如果由于某种原因您不能这样做(可能您无权访问标记化功能),您可以使用字典:

If for some reason you cannot do this (maybe you don't have access to the tokenization function), you can use a dictionary:

tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}

def custom_tokenizer(text):
    if text in tokens_dict:
        return Doc(nlp.vocab, tokens_dict[text])
    else:
        raise ValueError('No tokenization available for input.')

无论哪种方式,您都可以像第一个示例一样使用管道:

Either way, you can then use the pipeline as in your first example:

doc = nlp('Hello, world.')

这篇关于是否可以将 spacy 与已经标记化的输入一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆