如何对Pandas数据框的选定列进行Pearson相关 [英] How to do Pearson correlation of selected columns of a Pandas data frame

查看:80
本文介绍了如何对Pandas数据框的选定列进行Pearson相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的CSV:

I have a CSV that looks like this:

gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3

作为数据框,它看起来像这样:

And as data frame it looks like this:

In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
  gene  stem1  stem2  stem3  b1  b2  b3  special_col
0  foo     20     10     11  23  22  79            3
1  bar     17     13    505  12  13  88            1
2  qui     17     13      5  12  13  88            3

我想做的是从最后一列(special_col)开始对gene列和special column之间的每一列(即colnames[1:number_of_column-1]

What I want to do is to perform pearson correlation from last column (special_col) with every columns between gene column and special column, i.e. colnames[1:number_of_column-1]

一天结束时,我们将拥有长度为6的数据帧.

At the end of the day we will have length 6 data frame.

Coln   PearCorr
stem1  0.5
stem2 -0.5
stem3 -0.9999453506011533
b1    0.5
b2    0.5
b3    -0.5

以上值是手动计算的:

In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)

我该怎么办?

推荐答案

请注意,您的数据有误,所有3均属于特殊col,因此无法计算相关性.

如果最后删除列选择,您将获得正在分析的所有其他列的相关矩阵.最后[:-1]是要删除'special_col'与自身的相关性.

If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.

In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]: 
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

如果您对速度感兴趣,这在我的机器上会稍微快一些:

If you are interested in speed, this is slightly faster on my machine:

In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]: 
array([ 0.5       , -0.5       , -0.99994535,  0.5       ,  0.5       ,
       -0.5       ])

In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 µs per loop

In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 µs per loop

但是显然,它返回一个数组,而不是pandas series/DF.

But obviously, it returns an array, not a pandas series/DF.

这篇关于如何对Pandas数据框的选定列进行Pearson相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆