可视化从gensim生成的word2vec [英] Visualise word2vec generated from gensim

查看:41
本文介绍了可视化从gensim生成的word2vec的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用gensim在自己的语料库上训练了一个doc2vec和相应的word2vec.我想使用带有单词的t-sne可视化word2vec.如图所示,图中的每个点也带有单词".

我在这里看到了类似的问题:

您也可以修改点的大小,颜色等.祝您调试愉快!

I have trained a doc2vec and corresponding word2vec on my own corpus using gensim. I want to visualise the word2vec using t-sne with the words. As in, each dot in the figure has the "word" also with it.

I looked at a similar question here : t-sne on word2vec

Following it, I have this code :

import gensim import gensim.models as g

from sklearn.manifold import TSNE
import re
import matplotlib.pyplot as plt

modelPath="/Users/tarun/Desktop/PE/doc2vec/model3_100_newCorpus60_1min_6window_100trainEpoch.bin"
model = g.Doc2Vec.load(modelPath)

X = model[model.wv.vocab]
print len(X)
print X[0]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[:1000,:])

plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()

This gives a figure with dots but no words. That is I don't know which dot is representative of which word. How can I display the word with the dot?

解决方案

Two parts to the answer: how to get the word labels, and how to plot the labels on a scatterplot.

Word labels in gensim's word2vec

model.wv.vocab is a dict of {word: object of numeric vector}. To load the data into X for t-SNE, I made one change.

vocab = list(model.wv.vocab)
X = model[vocab]

This accomplishes two things: (1) it gets you a standalone vocab list for the final dataframe to plot, and (2) when you index model, you can be sure that you know the order of the words.

Proceed as before with

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

Now let's put X_tsne together with the vocab list. This is easy with pandas, so import pandas as pd if you don't have that yet.

df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])

The vocab words are the indices of the dataframe now.

I don't have your dataset, but in the other SO you mentioned, an example df that uses sklearn's newsgroups would look something like

                        x             y
politics    -1.524653e+20 -1.113538e+20
worry        2.065890e+19  1.403432e+20
mu          -1.333273e+21 -5.648459e+20
format      -4.780181e+19  2.397271e+19
recommended  8.694375e+20  1.358602e+21
arguing     -4.903531e+19  4.734511e+20
or          -3.658189e+19 -1.088200e+20
above        1.126082e+19 -4.933230e+19

Scatterplot

I like the object-oriented approach to matplotlib, so this starts out a little different.

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

Lastly, the annotate method will label coordinates. The first two arguments are the text label and the 2-tuple. Using iterrows(), this can be very succinct:

for word, pos in df.iterrows():
    ax.annotate(word, pos)

[Thanks to Ricardo in the comments for this suggestion.]

Then do plt.show() or fig.savefig(). Depending on your data, you'll probably have to mess with ax.set_xlim and ax.set_ylim to see into a dense cloud. This is the newsgroup example without any tweaking:

You can modify dot size, color, etc., too. Happy fine-tuning!

这篇关于可视化从gensim生成的word2vec的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆