如何最佳处理word2vec词汇中没有的单词 [英] How to handle words that are not in word2vec's vocab optimally

查看:1114
本文介绍了如何最佳处理word2vec词汇中没有的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约一千万个句子的列表,每个句子最多包含70个单词.

I have a list of ~10 million sentences, where each of them contains up to 70 words.

我在每个单词上运行gensim word2vec,然后取每个句子的简单平均值.问题是我使用的是min_count = 1000,所以很多单词不在词汇表中.

I'm running gensim word2vec on every word, and then taking the simple average of each sentence. The problem is that I use min_count=1000, so a lot of words are not in the vocab.

为了解决这个问题,我将vocab数组(包含大约10000个单词)与每个句子相交,如果该交集中至少剩余一个元素,则返回其简单平均值,否则返回零向量

To solve that, I intersect the vocab array (that contains about 10000 words) with every sentence, and if there's at least one element left in that intersection, it returns its the simple average, otherwise, it returns a vector of zeros.

问题在于,即使在拆分为多个线程时,在整个数据集上运行每个平均值时都需要花费很长时间,并且我想获得一个运行速度更快的更好的解决方案.

The issue is that calculating every average takes a very long time when I run it on the whole dataset, even when splitting into multiple threads, and I would like to get a better solution that could run faster.

我正在EC2 r4.4xlarge实例上运行它.

I'm running this on an EC2 r4.4xlarge instance.

我已经尝试过切换到doc2vec,这速度更快,但是结果不如word2vec的简单平均值.

I already tried switching to doc2vec, which was way faster, but the results were not as good as word2vec's simple average.

word2vec_aug_32x = Word2Vec(sentences=sentences, 
                        min_count=1000, 
                        size=32, 
                        window=2,
                        workers=16, 
                        sg=0)

vocab_arr = np.array(list(word2vec_aug_32x.wv.vocab.keys()))

def get_embedded_average(sentence):
    sentence = np.intersect1d(sentence, vocab_arr)
    if sentence.shape[0] > 0:
        return np.mean(word2vec_aug_32x[sentence], axis=0).tolist()
    else:
        return np.zeros(32).tolist()

pool = multiprocessing.Pool(processes=16)

w2v_averages = np.asarray(pool.map(get_embedded_average, np.asarray(sentences)))
pool.close()

如果您对以嵌入句子为目的的不同算法或技术有任何建议,并且可以解决我的问题,那么我很乐意阅读.

If you have any suggestions of different algorithms or techniques that have the same purpose of sentence embedding and could solve my problem, I would love to read about it.

推荐答案

您可以改用 FastText Word2Vec.通过查看子词信息(字符ngram),FastText可以嵌入词汇外的词. Gensim还具有FastText实现,该实现非常易于使用:

You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:

from gensim.models import FastText

model = FastText(sentences=training_data, size=128, ...)

word = 'hello' # can be out of vocabulary
embedding = model[word] # fetches the word embedding

这篇关于如何最佳处理word2vec词汇中没有的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆