两个 pandas 数据帧之间的快速Spearman相关 [英] Fast spearman correlation between two pandas dataframes

查看:55
本文介绍了两个 pandas 数据帧之间的快速Spearman相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将spearman相关应用于列数相同(每对行的相关)的两个熊猫数据帧.

I want to apply spearman correlation to two pandas dataframes with the same number of columns (correlation of each pair of rows).

我的目标是计算每对行(r,s)之间的spearman相关性分布,其中r是第一个数据帧的一行,s是第二个数据帧的一行.

My objective is to compute the distribution of spearman correlations between each pair of rows (r, s) where r is a row from the first dataframe and s is a row from the second dataframe.

我知道之前已经回答过类似的问题(请参阅).但是,这个问题有所不同,因为我想将第一个数据帧的每一行与第二个数据帧的所有行进行比较.此外,由于我的数据量大,因此计算量很大,并且要花费数小时.我想对其进行并行化,并可能对其进行重写以加快速度.

I am aware that similar questions have been answered before (see this). However, this question differs because I want to compare each row of first dataframe with ALL the rows in the second. Additionally, this is computationally intensive and it takes hours due to the size of my data. I want to parallelize it and possibly to rewrite it in order to speed it up.

我尝试使用numba,但不幸的是它失败了(类似于 this ),因为它似乎无法识别 scipy spearmanr .我的代码如下:

I tried with numba but unfortunately it fails (similar issue to this) because it seems to not recognize scipy spearmanr. My code is the following:

def corr(a, b):
    dist = []
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
    return dist

推荐答案

新答案

from numba import njit
import pandas as pd
import numpy as np

@njit
def mean1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].mean()
  return b

@njit
def std1(a):
  n = len(a)
  b = np.empty(n)
  for i in range(n):
    b[i] = a[i].std()
  return b


@njit
def c(a, b):
    ''' Correlation '''
    n, k = a.shape
    m, k = b.shape

    mu_a = mean1(a)
    mu_b = mean1(b)
    sig_a = std1(a)
    sig_b = std1(b)

    out = np.empty((n, m))

    for i in range(n):
        for j in range(m):
            out[i, j] = (a[i] - mu_a[i]) @ (b[j] - mu_b[j]) / k / sig_a[i] / sig_b[j]

    return out


r = df_test.rank(1).values
df_test.T.corr('spearman') == c(r, r)

旧答案

进行Spearman等级相关只是在进行等级之间的相关.

OLD ANSWER

Doing a Spearman Rank correlation is simply doing a correlation of the ranks.

我们可以利用argsort来获得排名.尽管argsortargsort确实获得了排名,但我们可以通过切片分配将自己限制为一种.

We can leverage argsort to get ranks. Though the argsort of the argsort does get us the ranks, we can limit ourselves to one sort by slice assigning.

def rank(a):
  i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')

  s = a.argsort(1)
  out = np.empty_like(s)
  out[i, s] = j

  return out


关联

在等级相关的情况下,均值和标准差均由数组第二维的大小预先确定.


Correlation

In the case of correlating ranks, the means and standard deviations are all predetermined by the size of the second dimension of the array.

您可以不用numba就能完成同样的事情,但是我假设您想要它.

You can accomplish this same thing without numba, but I'm assuming you want it.

from numba import njit

@njit
def c(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  out = np.empty((n, m))

  a = a - mu
  b = b - mu

  for i in range(n):
    for j in range(m):
      out[i, j] = a[i] @ b[j] / k / sig ** 2

  return out

为后代,我们可以完全避免内部循环,但这可能会导致内存问题.

For posterity, we could avoid the internal loop altogether but this might have memory issues.

@njit
def c1(a, b):
  n, k = a.shape
  m, k = b.shape

  mu = (k - 1) / 2
  sig = ((k - 1) * (k + 1) / 12) ** .5

  a = a - mu
  b = b - mu

  return a @ b.T / k / sig ** 2


演示

np.random.seed([3, 1415])

a = np.random.randn(2, 10)
b = np.random.randn(2, 10)

rank_a = rank(a)
rank_b = rank(b)

c(rank_a, rank_b)

array([[0.32121212, 0.01818182],
       [0.13939394, 0.55151515]])

如果您正在使用DataFrame

da = pd.DataFrame(a)
db = pd.DataFrame(b)

pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)


          0         1
0  0.321212  0.018182
1  0.139394  0.551515


验证

我们可以使用pandas.DataFrame.corr


Validation

We can do a quick validation using pandas.DataFrame.corr

pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)

      0     1
0  True  True
1  True  True

这篇关于两个 pandas 数据帧之间的快速Spearman相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆