如何对Pandas数据框的选定列进行Pearson相关 [英] How to do Pearson correlation of selected columns of a Pandas data frame
问题描述
我有一个看起来像这样的CSV:
I have a CSV that looks like this:
gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3
作为数据框,它看起来像这样:
And as data frame it looks like this:
In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
gene stem1 stem2 stem3 b1 b2 b3 special_col
0 foo 20 10 11 23 22 79 3
1 bar 17 13 505 12 13 88 1
2 qui 17 13 5 12 13 88 3
我想做的是从最后一列(special_col
)开始对gene
列和special column
之间的每一列(即colnames[1:number_of_column-1]
What I want to do is to perform pearson correlation from last column (special_col
) with every columns between gene
column and special column
, i.e. colnames[1:number_of_column-1]
一天结束时,我们将拥有长度为6的数据帧.
At the end of the day we will have length 6 data frame.
Coln PearCorr
stem1 0.5
stem2 -0.5
stem3 -0.9999453506011533
b1 0.5
b2 0.5
b3 -0.5
以上值是手动计算的:
In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)
我该怎么办?
推荐答案
请注意,您的数据有误,所有3均属于特殊col,因此无法计算相关性.
如果最后删除列选择,您将获得正在分析的所有其他列的相关矩阵.最后[:-1]是要删除'special_col'与自身的相关性.
If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.
In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
如果您对速度感兴趣,这在我的机器上会稍微快一些:
If you are interested in speed, this is slightly faster on my machine:
In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]:
array([ 0.5 , -0.5 , -0.99994535, 0.5 , 0.5 ,
-0.5 ])
In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 µs per loop
In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 µs per loop
但是显然,它返回一个数组,而不是pandas series/DF.
But obviously, it returns an array, not a pandas series/DF.
这篇关于如何对Pandas数据框的选定列进行Pearson相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!