Python：基于word2vec聚类相似词 [英] Python: clustering similar words based on word2vec

查看：691 发布时间：2020/10/3 2:22:14 python nlp cluster-analysis word2vec topic-modeling

本文介绍了Python：基于word2vec聚类相似词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这可能是我要提出的幼稚问题。我有一个标记化的语料库，在上面训练了Gensim的Word2vec模型。代码如下

  site = Article（ http://www.datasciencecentral.com/profiles/blogs/blockchain- and-artificial-intelligence-1）
 site.download（）
 site.parse（）
 
 def clean（doc）：
 stop_free =。 join（[[i for i in word_tokenize（doc.lower（）），如果我没有停止的话]]）
 punc_free =''.join（ch in stop_free中的ch，如果ch不排除在外）
 normalized =  .join（lemma.lemmatize（word）for punc_free.split（））中的单词
 snowed = .join（snowball.stem（word）for normalized.split（））中的单词
返回下雪的
 
b = clean（site.text）
 model = gensim.models.Word2Vec（[b]，min_count = 1，size = 32）
 print（model）＃ ##打印：Word2Vec（vocab = 643，size = 32，alpha = 0.025）####

要聚类相似的词，我正在使用PCA来可视化相似词的聚类。但是问题在于，它仅形成了如图中所示的大簇。

PCA&散点图代码：

  vocab = list（model.wv.vocab）
 X =模型[ vocab] 
 pca = PCA（n_components = 2）
 X_pca = pca.fit_transform（X）
 
 df = pd.concat（[pd.DataFrame（X_pca），
 pd.Series（vocab）]，
轴= 1）
 df.columns = ['x'，'y'，'word'] 
 
图= plt .figure（）
 ax = fig.add_subplot（1,1,1）
 ax.scatter（df ['x']，df ['y']）
 plt.show（ ）

所以，我在这里有三个问题：

1）仅一篇文章就足以清楚地区分集群？

2）如果我有一个训练有庞大语料库的模型并且我想预测

新文章中类似的词并以聚类的形式可视化它们（即我正在预测的文章中的词）吗？

非常感谢您的建议。谢谢。

解决方案

不，不是真的。作为参考，经过维基百科（英语）训练的普通word2vec模型包含大约30亿个单词。

您可以使用KNN（或类似的方法）。 Gensim具有 most_like 函数来获取最接近的单词。使用降维（例如PCA或tsne），您可以得到一个不错的集群。（不确定gensim是否具有tsne模块，但sklearn具有，因此您可以使用它）

btw您指的是某些图像，但不可用。

This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below

site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1")
site.download()
site.parse()

def clean(doc):
    stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    snowed = " ".join(snowball.stem(word) for word in normalized.split())
    return snowed   

b = clean(site.text)
model = gensim.models.Word2Vec([b],min_count=1,size=32)
print(model) ### Prints: Word2Vec(vocab=643, size=32, alpha=0.025) ####

To cluster similar words, I am using PCA to visualize the clusters of similar words. But the problem is that it is forming only big cluster as seen in the image.

PCA & scatter plot Code:

vocab = list(model.wv.vocab)
X = model[vocab]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

df = pd.concat([pd.DataFrame(X_pca),
                pd.Series(vocab)],
               axis=1)
df.columns = ['x','y','word']

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df['x'],df['y'])
plt.show()

So, I have three questions here:

1) Is just one article enough to have the clear segregation of the clusters?

2) If I have a model trained with huge corpus and I want to predict the similar words in the new article and visualize them (i.e. words in the article I'm predicting) in the form of clusters, is there a way to do that?

I highly appreciate your suggestions. Thank you.

解决方案

No, not really. For reference, common word2vec models which are trained on wikipedia (in english) consists around 3 billion words.
You can use KNN (or something similar). Gensim has the most_similar function to get the closest words. Using a dimensional reduction (like PCA or tsne) you can get yourself a nice cluster. (Not sure if gensim has tsne module, but sklearn has, so you can use it)

btw you're referring to some image, but it's not available.

这篇关于Python：基于word2vec聚类相似词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python：基于word2vec聚类相似词 [英] Python: clustering similar words based on word2vec

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python：基于word2vec聚类相似词 [英] Python: clustering similar words based on word2vec

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭