使用 pandas 和Scipy的树状图 [英] Dendrogram using pandas and scipy

查看:149
本文介绍了使用 pandas 和Scipy的树状图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用pandasscipy基于相关性生成树状图.我使用由收益构成的数据集(作为DataFrame),其大小为n x m,其中n是日期数,而m是公司数.然后我只运行脚本

I wish to generate a dendrogram based on correlation using pandas and scipy. I use a dataset (as a DataFrame) consisting of returns, which is of size n x m, where n is the number of dates and m the number of companies. Then I simply run the script

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np

m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_matrix = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_matrix, index=dates)

z = hc.linkage(dataframe.values.T, method='average', metric='correlation')
dendrogram = hc.dendrogram(z, labels=dataframe.columns)
plt.show()

我得到一个不错的树状图.现在,我想除了普通的Pearson相关之外,还想使用其他相关度量,这是通过简单地调用DataFrame.corr(method='<method>')并包含在pandas中的一项功能.所以,我起初以为只是运行以下代码

and I get a nice dendrogram. Now, the thing is that I'd also like to use other correlation measures apart from just ordinary Pearson correlation, which is a feature that's incorporated in pandas by simply invoking DataFrame.corr(method='<method>'). So, I thought at first that it was to simply run the following code

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np

m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_returns = np.random.normal(0, 0.01, size=(len(dates), m))

dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = dataframe.corr() 

z = hc.linkage(corr.values, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()

但是,如果这样做,我在y轴上会得到奇怪的值,即最大值> 1.4.而如果我运行第一个脚本,则大约为1.我在做什么错?我在hc.linkage中使用了错误的指标吗?

However, if I do this I get strange values on the y-axis as the maximum value > 1.4. Whereas if I run the first script it's about 1. What am I doing wrong? Am I using the wrong metric in hc.linkage?

编辑,我可能会添加树状图的形状完全相同.我必须用最大值归一化结果z的第三列吗?

EDIT I might add that the shape of the dendrogram is exactly the same. Do I have to normalize the third column of the resulting z with the maximum value?

推荐答案

找到了解决方案.如果您已经计算了距离矩阵(无论是相关性还是其他),只需使用distance.squareform压缩矩阵即可.也就是说,

Found the solution. If you have already calculated a distance matrix (be it correlation or whatever), you simply have to condense the matrix using distance.squareform. That is,

dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = 1 - dataframe.corr() 

corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()

这篇关于使用 pandas 和Scipy的树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆