将Python字典转换为Word2Vec对象 [英] Convert Python dictionary to Word2Vec object

查看:390
本文介绍了将Python字典转换为Word2Vec对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经获得了一个字典,将单词映射到python中的向量,并且我正试图分散绘制n个最相似的单词,因为大量单词上的TSNE永远占用了。最好的选择是将字典转换为w2v对象以对其进行处理。

I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.

推荐答案

我遇到了同样的问题,最终找到了解决方案

I had the same issue and I finaly found the solution

所以,我假设您的字典看起来像我的

So, I assume that your dictionary looks like mine

d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)

基本上,键是用户的ID,每个键都有一个形状为(300,)的向量。

Basically, the keys are the users' ids and each of them has a vector with shape (300,).

现在,为了将其用作word2vec,我需要首先将其保存到二进制文件中,然后使用gensim库加载它

So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library

from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils

m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)

其中my_save_word2vec_format函数为:

Where my_save_word2vec_format function is:

def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.

Parameters
----------
fname : str
    The file path used to save the vectors in.
vocab : dict
    The vocabulary of words.
vectors : numpy.array
    The vectors to be stored.
binary : bool, optional
    If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
    Explicitly specify total number of vectors
    (in case word vectors are appended with document vectors afterwards).

"""
if not (vocab or vectors):
    raise RuntimeError("no input")
if total_vec is None:
    total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
    print(total_vec, vector_size)
    fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
    # store in sorted order: most frequent words at the top
    for word, row in vocab.items():
        if binary:
            row = row.astype(REAL)
            fout.write(utils.to_utf8(word) + b" " + row.tostring())
        else:
            fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))

然后使用

m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)

将模型加载为word2vec

To load the model as word2vec

这篇关于将Python字典转换为Word2Vec对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆