Python:分类数据的排名顺序相关性 [英] Python: Rank order correlation for categorical data
问题描述
我是编程和统计学的新手,所以如果它在形式上不正确,请帮助我改善这个问题。
I am somewhat new to programming and statistics, so please help me improve this question if it is formally not correct.
我有很多参数,还有几个我在MonteCarlo仿真中生成的结果向量的集合。现在,我想测试每个参数对结果的影响。我已经有一个脚本与Kendall的Tau合作。现在,我想与Spearman和Pearson rho进行比较。例如:
I have a lot of parameters and a couple of result vectors I produced in a MonteCarlo simulation. Now I want to test the influence of each parameter for the result. I already got a script working with Kendall's Tau. Now I would like to compare with Spearman and Pearson rho. An example:
from scipy.stats import spearmanr, kendalltau, pearsonr
result = [106, 86, 100, 101, 99, 103, 97, 113, 112, 110]
parameter = ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
kendalltau(parameter, result)
>> (0.14907119849998596, 0.54850624613917143)
但是,如果我为 spearmanr尝试相同的操作
或 pearsonr
我遇到了错误。显然,此功能未在Scipy中实现。您知道获得分类数据相关系数的简单方法吗?
However if I try the same for spearmanr
or pearsonr
I get errors. Apparently this feature was not implemented in Scipy. Do you know of a simple way to obtain correlation coefficients for categorical data?
推荐答案
实际上spearmanr可以工作,但是pearsonr不会这样做需要计算数组的平均值, dtype
对于字符串不正确。见下文:
Actually spearmanr works, however pearsonr will not as it needs to calculate the mean of the array, dtype
is not correct for string. See below:
from scipy.stats import spearmanr, kendalltau, pearsonr
result = [106, 86, 100, 101, 99, 103, 97, 113, 112, 110]
parameter = ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
spearmanr(result, parameter)
(0.1740776559556978978,0.63053607555697644)
(0.17407765595569782, 0.63053607555697644)
help(pearsonr)
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
将'A'转换为1,'B'转换为2,例如
convert 'A' to 1, 'B' to 2, for example
params = [1 if el == 'A' else 2 for el in parameter]
print params
[1, 2, 1, 2, 1, 2, 1, 2, 1, 2]
pearsonr(params, result)
(-0.012995783552244984, 0.97157652425566488)
希望有帮助。
这篇关于Python:分类数据的排名顺序相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!