绘制文档tfidf 2D图 [英] plot a document tfidf 2D graph

查看：99 发布时间：2020/4/26 10:18:36 python numpy scipy scikit-learn k-means

本文介绍了绘制文档tfidf 2D图的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想为我的句子列表绘制一个二维图，其中x轴为术语，y轴为TFIDF分数(或文档ID).我使用scikit learning的fit_transform()来获得scipy矩阵，但是我不知道如何使用该矩阵来绘制图形.我正在尝试绘制一个图，以了解使用kmeans对我的句子进行分类的能力.

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph. I am trying to get a plot to see how well my sentences can be classified using kmeans.

这是fit_transform(sentence_list)的输出:

(文档ID，条款编号)tfidf得分

(0, 1023)   0.209291711271
(0, 924)    0.174405532933
(0, 914)    0.174405532933
(0, 821)    0.15579574484
(0, 770)    0.174405532933
(0, 763)    0.159719994016
(0, 689)    0.135518787598

这是我的代码:

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_   # Returns a list of clusters ranging 0 to 10

谢谢

推荐答案

使用词袋"时，每个句子都在一个高维空间中表示，该空间的长度等于词汇量.如果要以2D形式表示，则需要减小尺寸，例如，使用具有两个组件的PCA:

When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

newsgroups_train = fetch_20newsgroups(subset='train', 
                                      categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])        
X = pipeline.fit_transform(newsgroups_train.data).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show()              #not required if using ipython notebook

现在，您可以例如计算并绘制群集在此数据上输入的内容:

Now you can for example calculate and plot the cluster enters on this data:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)

plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c='r')
plt.show()              #not required if using ipython notebook

这篇关于绘制文档tfidf 2D图的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

绘制文档tfidf 2D图 [英] plot a document tfidf 2D graph

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

绘制文档tfidf 2D图 [英] plot a document tfidf 2D graph

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭