使用 pandas 和Scipy的树状图 [英] Dendrogram using pandas and scipy
问题描述
我希望使用pandas
和scipy
基于相关性生成树状图.我使用由收益构成的数据集(作为DataFrame
),其大小为n x m
,其中n
是日期数,而m
是公司数.然后我只运行脚本
I wish to generate a dendrogram based on correlation using pandas
and scipy
. I use a dataset (as a DataFrame
) consisting of returns, which is of size n x m
, where n
is the number of dates and m
the number of companies. Then I simply run the script
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_matrix = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_matrix, index=dates)
z = hc.linkage(dataframe.values.T, method='average', metric='correlation')
dendrogram = hc.dendrogram(z, labels=dataframe.columns)
plt.show()
我得到一个不错的树状图.现在,我想除了普通的Pearson相关之外,还想使用其他相关度量,这是通过简单地调用DataFrame.corr(method='<method>')
并包含在pandas
中的一项功能.所以,我起初以为只是运行以下代码
and I get a nice dendrogram. Now, the thing is that I'd also like to use other correlation measures apart from just ordinary Pearson correlation, which is a feature that's incorporated in pandas
by simply invoking DataFrame.corr(method='<method>')
. So, I thought at first that it was to simply run the following code
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
import numpy as np
m = 5
dates = pd.date_range('2013-01-01', periods=365)
random_returns = np.random.normal(0, 0.01, size=(len(dates), m))
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = dataframe.corr()
z = hc.linkage(corr.values, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()
但是,如果这样做,我在y轴上会得到奇怪的值,即最大值> 1.4.而如果我运行第一个脚本,则大约为1.我在做什么错?我在hc.linkage
中使用了错误的指标吗?
However, if I do this I get strange values on the y-axis as the maximum value > 1.4. Whereas if I run the first script it's about 1. What am I doing wrong? Am I using the wrong metric in hc.linkage
?
编辑,我可能会添加树状图的形状完全相同.我必须用最大值归一化结果z
的第三列吗?
EDIT I might add that the shape of the dendrogram is exactly the same. Do I have to normalize the third column of the resulting z
with the maximum value?
推荐答案
找到了解决方案.如果您已经计算了距离矩阵(无论是相关性还是其他),只需使用distance.squareform
压缩矩阵即可.也就是说,
Found the solution. If you have already calculated a distance matrix (be it correlation or whatever), you simply have to condense the matrix using distance.squareform
. That is,
dataframe = pd.DataFrame(data=random_returns, index=dates)
corr = 1 - dataframe.corr()
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='average')
dendrogram = hc.dendrogram(z, labels=corr.columns)
plt.show()
这篇关于使用 pandas 和Scipy的树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!