如何在Spacy中为OOV术语指定词向量? [英] How to specify word vector for OOV terms in Spacy?

查看:29
本文介绍了如何在Spacy中为OOV术语指定词向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个预训练的 word2vec 模型,我将其加载到 spacy 以向量化新单词.给定新文本,我执行 nlp('hi').vector 以获得单词 'hi' 的向量.

I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp('hi').vector to obtain the vector for the word 'hi'.

最终,我的预训练模型的词汇表中不存在一个新词需要向量化.在这种情况下,spacy 默认为一个用零填充的向量.我希望能够为 OOV 术语设置此默认向量.

Eventually, a new word needs to be vectorized which is not present in the vocabulary of my pre-trained model. In this scenario spacy defaults to a vector filled with zeros. I would like to be able to set this default vector for OOV terms.

示例:

import spacy
path_model= '/home/bionlp/spacy.bio_word2vec.model'
nlp=spacy.load(path_spacy)
print(nlp('abcdef').vector, '\n',nlp('gene').vector)

此代码为单词 'gene' 输出一个密集向量,为单词 'abcdef' 输出一个充满 0 的向量(因为它不存在于词汇表中):

This code outputs a dense vector for the word 'gene' and a vector full of 0s for the word 'abcdef' (since it's not present in the vocabulary):

我的目标是能够为缺失的单词指定向量,因此您可以获得(例如)一个充满 1 的向量,而不是为单词 'abcdef' 获取一个充满 0 的向量.

My goal is to be able to specify the vector for missing words, so instead of getting a vector full of 0s for the word 'abcdef' you can get (for instance) a vector full of 1s.

推荐答案

如果你只是想要你的插件向量而不是 SpaCy 默认的全零向量,你可以添加一个额外的步骤来替换任何全零向量和你的.例如:

If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:

words = ['words', 'may', 'by', 'fehlt']
my_oov_vec = ...  # whatever you like
spacy_vecs = [nlp(word) for word in words]
fixed_vecs = [vec if vec.any() else my_oov_vec 
              for vec in spacy_vecs]

我不确定您为什么要这样做.大量使用词向量的工作只是省略了词汇表外的词;使用任何插头值,包括 SpaCy 的零向量,可能只会增加无用的噪音.

I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.

如果更好地处理 OOV 词很重要,请注意其他一些词向量模型,如 FastText,可以通过使用训练期间为子词片段学习的向量来合成 OOV 词的胜于无的猜测向量.这类似于人们经常从熟悉的词根中找出一个词的主旨.

And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.

这篇关于如何在Spacy中为OOV术语指定词向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆