在python中绘制2D矩阵,代码和最有用的可视化 [英] plotting a 2D matrix in python, code and most useful visualization

查看:1298
本文介绍了在python中绘制2D矩阵,代码和最有用的可视化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的矩阵(10x55678),格式为"numpy".该矩阵的行对应于一些主题",而列对应于单词(来自文本语料库的唯一单词).此矩阵中的每个条目i,j都是一个概率,这意味着单词j属于主题i的概率为x.由于我使用的是ids而不是真实的单词,并且由于矩阵的维数很大,因此我需要以某种方式对其进行可视化.您建议使用哪种可视化?一个简单的情节?还是更复杂,更有用的信息?(我问这些原因是因为我对可视化的有用类型一无所知).如果可能的话,您可以举一个使用numpy矩阵的例子吗?谢谢

I have a very large matrix(10x55678) in "numpy" matrix format. the rows of this matrix correspond to some "topics" and the columns correspond to words(unique words from a text corpus). Each entry i,j in this matrix is a probability, meaning that word j belongs to topic i with probability x. since I am using ids rather than the real words and since the dimension of my matrix is really large I need to visualized it in a way.Which visualization do you suggest? a simple plot? or a more sophisticated and informative one?(i am asking these cause I am ignorant about the useful types of visualization). If possible can you give me an example that using a numpy matrix? thanks

我问这个问题的原因是我想对我的语料库中的单词主题分布有一个大致的了解.其他任何方法都欢迎

the reason I asked this question is that I want to have a general view of the word-topic distributions in my corpus. any other methods are welcome

推荐答案

您当然可以使用matplotlib的imshowpcolor方法来显示数据,但是正如评论所提到的,如果不进行放大,可能很难解释.数据的子集.

You could certainly use matplotlib's imshowor pcolor method to display the data, but as comments have mentioned, it might be hard to interpret without zooming in on subsets of the data.

a = np.random.normal(0.0,0.5,size=(5000,10))**2
a = a/np.sum(a,axis=1)[:,None]  # Normalize

pcolor(a)

然后您可以根据单词属于某个簇的概率对单词进行排序:

You could then sort the words by the probability that they belong to a cluster:

maxvi = np.argsort(a,axis=1)
ii = np.argsort(maxvi[:,-1])

pcolor(a[ii,:])

由于已对事物进行了排序,因此y轴上的单词索引不再等于原始顺序.

Here the word index on the y-axis no longer equals the original ordering since things have been sorted.

另一种可能性是使用networkx包为每个类别绘制单词簇,其中概率最高的单词由更大或更靠近图中心的节点表示,而忽略那些具有该类别中没有成员资格.这可能会更容易,因为您有大量的单词和少量的类别.

Another possibility is to use the networkx package to plot word clusters for each category, where the words with the highest probability are represented by nodes that are either larger or closer to the center of the graph and ignore those words that have no membership in the category. This might be easier since you have a large number of words and a small number of categories.

希望这些建议之一是有用的.

Hopefully one of these suggestions is useful.

这篇关于在python中绘制2D矩阵,代码和最有用的可视化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆