两个DataFrames(Python/Pandas)中的每一行和每一列之间的差异 [英] Difference between every row and column in two DataFrames (Python / Pandas)

查看:87
本文介绍了两个DataFrames(Python/Pandas)中的每一行和每一列之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有更有效的方法将一个DF中的每一行中的每一列与另一DF中的每一行中的每一列进行比较?这让我感到草率,但是我的循环/应用尝试却慢了很多.

Is there a more efficient way to compare every column in every row in one DF to every column in every row of another DF? This feels sloppy to me, but my loop / apply attempts have been much slower.

df1 = pd.DataFrame({'a': np.random.randn(1000),
                   'b': [1, 2] * 500,
                   'c': np.random.randn(1000)},
                   index=pd.date_range('1/1/2000', periods=1000))
df2 = pd.DataFrame({'a': np.random.randn(100),
                'b': [2, 1] * 50,
                'c': np.random.randn(100)},
               index=pd.date_range('1/1/2000', periods=100))
df1 = df1.reset_index()
df1['embarrassingHackInd'] = 0
df1.set_index('embarrassingHackInd', inplace=True)
df1.rename(columns={'index':'origIndex'}, inplace=True)
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
timeit df3 = abs(df1-df2)

10个循环,每个循环最好3:60.6毫秒

10 loops, best of 3: 60.6 ms per loop

我需要知道进行哪个比较,因此很难将每个相对的索引添加到比较DF中,以使其最终出现在最终DF中.

I need to know which comparison was made, thus the ugly addition of each opposing index to the comparison DF so that it will end up in the final DF.

在此先感谢您的帮助.

推荐答案

您发布的代码显示了一种生成减法表的巧妙方法.但是,它不能发挥熊猫的优势. Pandas DataFrames将基础数据存储在基于列的块中.因此,按列而不是按行完成数据检索最快.由于所有行都具有相同的索引,因此减法是按行执行的(将每行与其他行配对),这意味着df1-df2中有很多基于行的数据检索.对于熊猫来说,这不是理想的选择,尤其是当并非所有列都具有相同的dtype时.

The code you posted shows a clever way to produce a subtraction table. However, it doesn't play to Pandas strengths. Pandas DataFrames store the underlying data in column-based blocks. So retrieval of the data is fastest when done by column, not by row. Since all the rows have the same index, the subtractions are performed by row (pairing each row with every other row), which means there is a lot of row-based data retrieval going on in df1-df2. That's not ideal for Pandas, particularly when not all the columns have the same dtype.

减法表是NumPy擅长的事情:

Subtraction tables are something NumPy is good at:

In [5]: x = np.arange(10)

In [6]: y = np.arange(5)

In [7]: x[:, np.newaxis] - y
Out[7]: 
array([[ 0, -1, -2, -3, -4],
       [ 1,  0, -1, -2, -3],
       [ 2,  1,  0, -1, -2],
       [ 3,  2,  1,  0, -1],
       [ 4,  3,  2,  1,  0],
       [ 5,  4,  3,  2,  1],
       [ 6,  5,  4,  3,  2],
       [ 7,  6,  5,  4,  3],
       [ 8,  7,  6,  5,  4],
       [ 9,  8,  7,  6,  5]])

您可以将x视为df1的一列,而将y视为df2的一列.您将在下面看到NumPy可以使用基本相同的语法以基本相同的方式处理df1的所有列和df2的所有列.

You can think of x as one column of df1, and y as one column of df2. You'll see below that NumPy can handle all the columns of df1 and all the columns of df2 in basically the same way, using basically the same syntax.

下面的代码定义了origusing_numpy. orig是您发布的代码,using_numpy是使用NumPy数组执行减法的另一种方法:

The code below defines orig and using_numpy. orig is the code you posted, using_numpy is an alternative method which performs the subtraction using NumPy arrays:

In [2]: %timeit orig(df1.copy(), df2.copy())
10 loops, best of 3: 96.1 ms per loop

In [3]: %timeit using_numpy(df1.copy(), df2.copy())
10 loops, best of 3: 19.9 ms per loop


import numpy as np
import pandas as pd
N = 100
df1 = pd.DataFrame({'a': np.random.randn(10*N),
                   'b': [1, 2] * 5*N,
                   'c': np.random.randn(10*N)},
                   index=pd.date_range('1/1/2000', periods=10*N))
df2 = pd.DataFrame({'a': np.random.randn(N),
                'b': [2, 1] * (N//2),
                'c': np.random.randn(N)},
               index=pd.date_range('1/1/2000', periods=N))

def orig(df1, df2):
    df1 = df1.reset_index() # 312 µs per loop
    df1['embarrassingHackInd'] = 0 # 75.2 µs per loop
    df1.set_index('embarrassingHackInd', inplace=True) # 526 µs per loop
    df1.rename(columns={'index':'origIndex'}, inplace=True) # 209 µs per loop
    df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 # 23.1 µs per loop
    df1['df2Date'] = 0

    df2 = df2.reset_index()
    df2['embarrassingHackInd'] = 0
    df2.set_index('embarrassingHackInd', inplace=True)
    df2.rename(columns={'index':'origIndex'}, inplace=True)
    df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
    df2['df1Date'] = 0
    df3 = abs(df1-df2) # 88.7 ms per loop  <-- this is the bottleneck
    return df3

def using_numpy(df1, df2):
    df1.index.name = 'origIndex'
    df2.index.name = 'origIndex'
    df1.reset_index(inplace=True) 
    df2.reset_index(inplace=True) 
    df1_date = df1['origIndex']
    df2_date = df2['origIndex']
    df1['origIndex'] = df1_date.astype(np.int64) 
    df2['origIndex'] = df2_date.astype(np.int64) 

    arr1 = df1.values
    arr2 = df2.values
    arr3 = np.abs(arr1[:,np.newaxis,:]-arr2) # 3.32 ms per loop vs 88.7 ms 
    arr3 = arr3.reshape(-1, 4)
    index = pd.MultiIndex.from_product(
        [df1_date, df2_date], names=['df1Date', 'df2Date'])
    result = pd.DataFrame(arr3, index=index, columns=df1.columns)
    # You could stop here, but the rest makes the result more similar to orig
    result.reset_index(inplace=True, drop=False)
    result['df1Date'] = result['df1Date'].astype(np.int64) // 10**9
    result['df2Date'] = result['df2Date'].astype(np.int64) // 10**9
    return result

def is_equal(expected, result):
    expected.reset_index(inplace=True, drop=True)
    result.reset_index(inplace=True, drop=True)

    # expected has dtypes 'O', while result has some float and int dtypes. 
    # Make all the dtypes float for a quick and dirty comparison check
    expected = expected.astype('float')
    result = result.astype('float')
    columns = ['a','b','c','origIndex','df1Date','df2Date']
    return expected[columns].equals(result[columns])

expected = orig(df1.copy(), df2.copy())
result = using_numpy(df1.copy(), df2.copy())
assert is_equal(expected, result)


x[:, np.newaxis] - y的工作方式:


How x[:, np.newaxis] - y works:

此表达式利用NumPy广播. 要了解广播-以及通常与NumPy一起使用-了解数组的形状很有意义:

This expression takes advantage of NumPy broadcasting. To understand broadcasting -- and in general with NumPy -- it pays to know the shape of the arrays:

In [6]: x.shape
Out[6]: (10,)

In [7]: x[:, np.newaxis].shape
Out[7]: (10, 1)

In [8]: y.shape
Out[8]: (5,)

[:, np.newaxis]右侧x上添加了一个新轴,因此形状为(10, 1).因此,x[:, np.newaxis] - y是形状为(10, 1)的数组与形状为(5,)的数组相减的结果.

The [:, np.newaxis] adds a new axis to x on the right, so the shape is (10, 1). So x[:, np.newaxis] - y is the subtraction of an array of shape (10, 1) with an array of shape (5,).

从表面上看,这没有任何意义,但是NumPy数组广播其形状

On the face of it, that doesn't make sense, but NumPy arrays broadcast their shape according to certain rules to try to make their shapes compatible.

第一个规则是可以在左侧上添加新轴.因此,形状为(5,)的数组可以将自身广播为形状为(1, 5)的形状.

The first rule is that new axes can be added on the left. So an array of shape (5,) can broadcast itself to shape (1, 5).

下一个规则是长度为1的轴可以将其自身广播为任意长度.数组中的值会根据需要沿着附加维重复多次.

The next rule is that axes of length 1 can broadcast itself to arbitrary length. The values in the array are simply repeated as often as needed along the extra dimension(s).

因此,当在NumPy算术运算中将形状为(10, 1)(1, 5)的数组放在一起时,它们都将广播到形状为(10, 5)的数组:

So when arrays of shape (10, 1) and (1, 5) are put together in a NumPy arithmetic operation, they are both broadcasted up to arrays of shape (10, 5):

In [14]: broadcasted_x, broadcasted_y = np.broadcast_arrays(x[:, np.newaxis], y)

In [15]: broadcasted_x
Out[15]: 
array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5],
       [6, 6, 6, 6, 6],
       [7, 7, 7, 7, 7],
       [8, 8, 8, 8, 8],
       [9, 9, 9, 9, 9]])

In [16]: broadcasted_y
Out[16]: 
array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

所以x[:, np.newaxis] - y等同于broadcasted_x - broadcasted_y.

现在,借助这个简单的示例,我们可以看一下 arr1[:,np.newaxis,:]-arr2.

Now, with this simpler example under our belt, we can look at arr1[:,np.newaxis,:]-arr2.

arr1的形状为(1000, 4)arr2的形状为(100, 4).我们要减去长度为4的轴上的项,沿着1000长度的轴上的每一行,以及沿着100长度的轴上的每一行.换句话说,我们希望减法形成一个形状为(1000, 100, 4)的数组.

arr1 has shape (1000, 4) and arr2 has shape (100, 4). We want to subtract the items in the axis of length 4, for each row along the 1000-length axis, and each row along the 100-length axis. In other words, we want the subtraction to form an array of shape (1000, 100, 4).

重要的是,我们不希望1000-axis100-axis进行交互. 我们希望它们位于不同的轴上.

Importantly, we don't want the 1000-axis to interact with the 100-axis. We want them to be in separate axes.

因此,如果像这样将轴添加到arr1:arr1[:,np.newaxis,:],则其形状将变为

So if we add an axis to arr1 like this: arr1[:,np.newaxis,:], then its shape becomes

In [22]: arr1[:, np.newaxis, :].shape
Out[22]: (1000, 1, 4)

现在,NumPy广播将两个阵列都抽成相同的形状(1000, 100, 4).瞧,减法表.

And now, NumPy broadcasting pumps up both arrays to the common shape of (1000, 100, 4). Voila, a subtraction table.

要将值按摩到形状为(1000*100, 4)的2D DataFrame中,我们可以使用reshape:

To massage the values into a 2D DataFrame of shape (1000*100, 4), we can use reshape:

arr3 = arr3.reshape(-1, 4)

-1告诉NumPy将-1替换为需要使整形有意义的任何正整数.由于arr具有1000 * 100 * 4的值,因此-1被替换为1000*100.使用-1比编写1000*100更好,因为即使我们更改df1df2中的行数,它也可以使代码正常工作.

The -1 tells NumPy to replace -1 with whatever positive integer is needed for the reshape to make sense. Since arr has 1000*100*4 values, the -1 is replaced with 1000*100. Using -1 is nicer than writing 1000*100 however since it allows the code to work even if we change the number of rows in df1 and df2.

这篇关于两个DataFrames(Python/Pandas)中的每一行和每一列之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆