计算所有列差异的最快方法 [英] Fastest way to calculate difference in all columns

查看:80
本文介绍了计算所有列差异的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个所有浮点数列的数据框.例如:

I have a dataframe of all float columns. For example:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
#    A    B     C     D
# 0  0.0  1.0   2.0   3.0
# 1  4.0  5.0   6.0   7.0
# 2  8.0  9.0  10.0  11.0

我想为列的所有组合(例如A-B,A-C,B-C等)计算按列的差异.

I would like to calculate column-wise differences for all combinations of columns (e.g., A-B, A-C, B-C, etc.).

例如,所需的输出将类似于:

E.g., the desired output would be something like:

 A_B   A_C   A_D   B_C   B_D   C_D
-1.0  -2.0  -3.0  -1.0  -2.0  -1.0
-1.0  -2.0  -3.0  -1.0  -2.0  -1.0
-1.0  -2.0  -3.0  -1.0  -2.0  -1.0

由于列数可能很大,所以我想尽可能高效/快速地进行计算.我假设我会先将数据帧转换为numpy数组,从而获得很大的提速,所以我会这样做,但是我想知道是否还有其他策略可能会导致性能大幅提高.也许某些矩阵代数或多维数据格式技巧使您不必遍历所有唯一组合.欢迎任何建议.该项目在Python 3中.

Since the number of columns may be large, I'd like to do the calculations as efficiently/quickly as possible. I assume I'll get a big speed bump by converting the dataframe to a numpy array first so I'll do that, but I'm wondering if there are any other strategies that might result in large performance gains. Maybe some matrix algebra or multidimensional data format trick that results in not having to loop through all unique combinations. Any suggestions are welcome. This project is in Python 3.

推荐答案

本文中列出了两种NumPy提高性能的方法-一种是完全矢量化的方法,另一种是一个循环的方法.

Listed in this post are two NumPy approaches for performance - One would be fully vectorized approach and another with one loop.

方法1

def numpy_triu1(df):          
    a = df.values
    r,c = np.triu_indices(a.shape[1],1)
    cols = df.columns
    nm = [cols[i]+"_"+cols[j] for i,j in zip(r,c)]
    return pd.DataFrame(a[:,r] - a[:,c], columns=nm)

样品运行-

In [72]: df
Out[72]: 
     A    B     C     D
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

In [78]: numpy_triu(df)
Out[78]: 
   A_B  A_C  A_D  B_C  B_D  C_D
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0

方法2

如果我们可以将数组作为输出或数据框使用而无需特殊的列名,那么这是另一个-

If we are okay with array as output or dataframe without specialized column names, here's another -

def pairwise_col_diffs(a): # a would df.values
    n = a.shape[1]
    N = n*(n-1)//2
    idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
    start, stop = idx[:-1], idx[1:]
    out = np.empty((a.shape[0],N),dtype=a.dtype)
    for j,i in enumerate(range(n-1)):
        out[:, start[j]:stop[j]] = a[:,i,None] - a[:,i+1:]
    return out


运行时测试

由于OP提到多维度数组输出也适用于它们,因此以下是其他作者基于数组的方法-

Since OP has mentioned that multi-dim array output would work for them as well, here are the array based approaches from other author(s) -

# @Allen's soln
def Allen(arr):
    n = arr.shape[1]
    idx = np.asarray(list(itertools.combinations(range(n),2))).T
    return arr[:,idx[0]]-arr[:,idx[1]]

# @DYZ's soln
def DYZ(arr):
    result = np.concatenate([(arr.T - arr.T[x])[x+1:] \
            for x in range(arr.shape[1])]).T
    return result

@Gerges Dib的帖子中基于

pandas的解决方案未包括在内,因为与其他解决方案相比,该解决方案的运行速度非常慢.

pandas based solution from @Gerges Dib's post wasn't included as it came out very slow as compared to others.

时间-

我们将使用三种数据集大小-1005001000:

We will use three dataset sizes - 100, 500 and 1000 :

In [118]: df = pd.DataFrame(np.random.randint(0,9,(3,100)))
     ...: a = df.values
     ...: 

In [119]: %timeit DYZ(a)
     ...: %timeit Allen(a)
     ...: %timeit pairwise_col_diffs(a)
     ...: 
1000 loops, best of 3: 258 µs per loop
1000 loops, best of 3: 1.48 ms per loop
1000 loops, best of 3: 284 µs per loop

In [121]: df = pd.DataFrame(np.random.randint(0,9,(3,500)))
     ...: a = df.values
     ...: 

In [122]: %timeit DYZ(a)
     ...: %timeit Allen(a)
     ...: %timeit pairwise_col_diffs(a)
     ...: 
100 loops, best of 3: 2.56 ms per loop
10 loops, best of 3: 39.9 ms per loop
1000 loops, best of 3: 1.82 ms per loop

In [123]: df = pd.DataFrame(np.random.randint(0,9,(3,1000)))
     ...: a = df.values
     ...: 

In [124]: %timeit DYZ(a)
     ...: %timeit Allen(a)
     ...: %timeit pairwise_col_diffs(a)
     ...: 
100 loops, best of 3: 8.61 ms per loop
10 loops, best of 3: 167 ms per loop
100 loops, best of 3: 5.09 ms per loop

这篇关于计算所有列差异的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆