在Spacy中对向量求平均时，请忽略词汇外的单词 [英] Ignore out-of-vocabulary words when averaging vectors in Spacy

查看：80 发布时间：2020/5/18 1:08:07 python nlp word2vec spacy

本文介绍了在Spacy中对向量求平均时，请忽略词汇外的单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在Spacy中使用经过预训练的word2vec模型，通过以下方式对标题进行编码:(1)将单词映射到其矢量嵌入，以及(2)执行单词嵌入的均值.

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings.

为此，我使用以下代码:

To do this I use the following code:

import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector

其中nlp(sentence).vector(1)用空格分割将我的句子标记化，(2)根据提供的字典对每个单词进行向量化，(3)对句子中的单词向量求平均以提供单个输出向量.那又快又酷.

Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. That's fast and cool.

但是，在此过程中，语音(OOV)术语被映射到n维0向量，这会影响所得平均值.相反，我希望在执行平均值时忽略OOV项.在我的示例中，" butitsalsodistractive "是词典中不存在的唯一术语，因此我想使用nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.

However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. Instead, I would like OOV terms to be ignored when performing the average. In my example, 'butitsalsodistractive' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.

我已经可以通过后处理步骤做到这一点(请参见下面的代码)，但是对于我的目的来说这变得太慢了，所以我想知道是否有一种方法可以告诉nlp管道忽略OOV.提前条款?因此，在调用nlp(sentence).vector时，在计算均值时不包括OOV项向量

I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean

import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)

尝试的方法

在两种情况下，documents是一个包含200个字符串元素(每个单词≈400个单词)的列表.

In both cases documents is a list with 200 string elements with ≈ 400 words each.

不处理OOV条款:

import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [document.vector for document in list(nlp.pipe(documents))]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s

忽略输出向量中的OOV项.请注意，在这种情况下，对于所有单词均为OOV的情况，我们需要添加一个额外的"if"语句(如果发生这种情况，则输出向量为r_vec):

r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
    vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
    if vectors.size == 0:
        # Case in which none of the words in text were in vocabulary
        avg_vector = r_vec
    else:
        avg_vector = vectors.mean(axis=0)
    return avg_vector

times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [get_vector(document) for document in documents]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s

在此示例中，矢量化200个文档的平均时差时间为0.34s.但是，当处理200M文档时，这变得很关键.我知道第二种方法需要一个额外的'if'条件来处理充满OOV术语的文档，这可能会稍微增加计算时间.另外，在第一种情况下，我可以使用nlp.pipe(documents)一次性处理所有文档，我想这必须优化流程.

In this example the mean difference time in vectorizing 200 documents was 0.34s. However, when processing 200M documents this becomes critical. I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process.

我总是可以寻找额外的计算资源来应用第二段代码，但是我想知道是否有任何方法可以使用nlp.pipe(documents)忽略输出中的OOV术语.任何建议都将非常受欢迎.

I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. Any suggestion will be very much welcome.

在Spacy中对向量求平均时，请忽略词汇外的单词 [英] Ignore out-of-vocabulary words when averaging vectors in Spacy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Spacy中对向量求平均时，请忽略词汇外的单词 [英] Ignore out-of-vocabulary words when averaging vectors in Spacy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭