在Spacy中对向量求平均时,请忽略词汇外的单词 [英] Ignore out-of-vocabulary words when averaging vectors in Spacy

查看:80
本文介绍了在Spacy中对向量求平均时,请忽略词汇外的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Spacy中使用经过预训练的word2vec模型,通过以下方式对标题进行编码:(1)将单词映射到其矢量嵌入,以及(2)执行单词嵌入的均值.

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings.

为此,我使用以下代码:

To do this I use the following code:

import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector

其中nlp(sentence).vector(1)用空格分割将我的句子标记化,(2)根据提供的字典对每个单词进行向量化,(3)对句子中的单词向量求平均以提供单个输出向量.那又快又酷.

Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. That's fast and cool.

但是,在此过程中,语音(OOV)术语被映射到n维0向量,这会影响所得平均值.相反,我希望在执行平均值时忽略OOV项.在我的示例中," butitsalsodistractive "是词典中不存在的唯一术语,因此我想使用nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.

However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. Instead, I would like OOV terms to be ignored when performing the average. In my example, 'butitsalsodistractive' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector.

我已经可以通过后处理步骤做到这一点(请参见下面的代码),但是对于我的目的来说这变得太慢了,所以我想知道是否有一种方法可以告诉nlp管道忽略OOV.提前条款?因此,在调用nlp(sentence).vector时,在计算均值时不包括OOV项向量

I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean

import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)

尝试的方法

在两种情况下,documents是一个包含200个字符串元素(每个单词≈400个单词)的列表.

In both cases documents is a list with 200 string elements with ≈ 400 words each.

  1. 不处理OOV条款:

import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [document.vector for document in list(nlp.pipe(documents))]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s

  1. 忽略输出向量中的OOV项.请注意,在这种情况下,对于所有单词均为OOV的情况,我们需要添加一个额外的"if"语句(如果发生这种情况,则输出向量为r_vec):

r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
    vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
    if vectors.size == 0:
        # Case in which none of the words in text were in vocabulary
        avg_vector = r_vec
    else:
        avg_vector = vectors.mean(axis=0)
    return avg_vector

times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [get_vector(document) for document in documents]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s

在此示例中,矢量化200个文档的平均时差时间为0.34s.但是,当处理200M文档时,这变得很关键.我知道第二种方法需要一个额外的'if'条件来处理充满OOV术语的文档,这可能会稍微增加计算时间.另外,在第一种情况下,我可以使用nlp.pipe(documents)一次性处理所有文档,我想这必须优化流程.

In this example the mean difference time in vectorizing 200 documents was 0.34s. However, when processing 200M documents this becomes critical. I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process.

我总是可以寻找额外的计算资源来应用第二段代码,但是我想知道是否有任何方法可以使用nlp.pipe(documents)忽略输出中的OOV术语.任何建议都将非常受欢迎.

I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. Any suggestion will be very much welcome.

推荐答案

请参见

see this post by the author of Spacy which says:

该Doc对象具有不可变的文本,但是使用所需令牌的子集创建一个新的Doc对象应该非常容易并且非常有效.

The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want.

尝试例如:

import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np

sentence = "I love Stack Overflow butitsalsodistractive"

print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])


np.array_equal(tokens.vector, tokensClean.vector)
#False

如果您想加快速度,请在不使用时禁用流水线组件(例如NER,依赖项解析等).

If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..)

这篇关于在Spacy中对向量求平均时,请忽略词汇外的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆