Python:基于word2vec聚类相似词 [英] Python: clustering similar words based on word2vec
问题描述
这可能是我要提出的幼稚问题。我有一个标记化的语料库,在上面训练了Gensim的Word2vec模型。代码如下
site = Article( http://www.datasciencecentral.com/profiles/blogs/blockchain- and-artificial-intelligence-1)
site.download()
site.parse()
def clean(doc):
stop_free =。 join([[i for i in word_tokenize(doc.lower()),如果我没有停止的话]])
punc_free =''.join(ch in stop_free中的ch,如果ch不排除在外)
normalized = .join(lemma.lemmatize(word)for punc_free.split())中的单词
snowed = .join(snowball.stem(word)for normalized.split())中的单词
返回下雪的
b = clean(site.text)
model = gensim.models.Word2Vec([b],min_count = 1,size = 32)
print(model)# ##打印:Word2Vec(vocab = 643,size = 32,alpha = 0.025)####
要聚类相似的词,我正在使用PCA来可视化相似词的聚类。但是问题在于,它仅形成了如图中所示的大簇。
PCA&散点图代码:
vocab = list(model.wv.vocab)
X =模型[ vocab]
pca = PCA(n_components = 2)
X_pca = pca.fit_transform(X)
df = pd.concat([pd.DataFrame(X_pca),
pd.Series(vocab)],
轴= 1)
df.columns = ['x','y','word']
图= plt .figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df ['x'],df ['y'])
plt.show( )
所以,我在这里有三个问题:
1)仅一篇文章就足以清楚地区分集群?
2)如果我有一个训练有庞大语料库的模型并且我想预测
新文章中类似的词并以聚类的形式可视化它们(即我正在预测的文章中的词)吗?
非常感谢您的建议。谢谢。
- 不,不是真的。作为参考,经过维基百科(英语)训练的普通word2vec模型包含大约30亿个单词。
- 您可以使用KNN(或类似的方法)。 Gensim具有
most_like
函数来获取最接近的单词。使用降维(例如PCA或tsne),您可以得到一个不错的集群。 (不确定gensim是否具有tsne模块,但sklearn具有,因此您可以使用它)
btw您指的是某些图像,但不可用。
This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below
site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1")
site.download()
site.parse()
def clean(doc):
stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
snowed = " ".join(snowball.stem(word) for word in normalized.split())
return snowed
b = clean(site.text)
model = gensim.models.Word2Vec([b],min_count=1,size=32)
print(model) ### Prints: Word2Vec(vocab=643, size=32, alpha=0.025) ####
To cluster similar words, I am using PCA to visualize the clusters of similar words. But the problem is that it is forming only big cluster as seen in the image.
PCA & scatter plot Code:
vocab = list(model.wv.vocab)
X = model[vocab]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
df = pd.concat([pd.DataFrame(X_pca),
pd.Series(vocab)],
axis=1)
df.columns = ['x','y','word']
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df['x'],df['y'])
plt.show()
So, I have three questions here:
1) Is just one article enough to have the clear segregation of the clusters?
2) If I have a model trained with huge corpus and I want to predict the similar words in the new article and visualize them (i.e. words in the article I'm predicting) in the form of clusters, is there a way to do that?
I highly appreciate your suggestions. Thank you.
- No, not really. For reference, common word2vec models which are trained on wikipedia (in english) consists around 3 billion words.
- You can use KNN (or something similar). Gensim has the
most_similar
function to get the closest words. Using a dimensional reduction (like PCA or tsne) you can get yourself a nice cluster. (Not sure if gensim has tsne module, but sklearn has, so you can use it)
btw you're referring to some image, but it's not available.
这篇关于Python:基于word2vec聚类相似词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!