pandas 将具有多索引和重叠索引级别的数据框相乘 [英] Pandas multiply dataframes with multiindex and overlapping index levels
问题描述
我正在努力完成一项本应很简单的任务,但是它没有按我认为的那样工作.我有两个数字数据框架A和B,分别具有multiindex和下面的列:
I´m struggling with a task that should be simple, but it is not working as I thought it would. I have two numeric dataframes A and B with multiindex and columns below:
A = A B C D
X 1 AX1 BX1 CX1 DX1
2 AX2 BX2 CX2 DX2
3 AX3 BX3 CX3 DX3
Y 1 AY1 BY1 CY1 DY1
2 AY2 BY2 CY2 DY2
3 AY3 BY3 CY3 DY3
B = A B C D
X 1 a AX1a BX1a CX1a DX1a
b AX1b BX1b CX1b DX1b
c AX1c BX1c CX1c DX1c
2 a AX2a BX2a CX2a DX2a
b AX2b BX2b CX2b DX2b
c AX2c BX2c CX2c DX2c
3 a AX3a BX3a CX3a DX3a
b AX3b BX3b CX3b DX3b
c AX3c BX3c CX3c DX3c
Y 1 a AY1a BY1a CY1a DY1a
b AY1b BY1b CY1b DY1b
c AY1c BY1c CY1c DY1c
2 a AY2a BY2a CY2a DY2a
b AY2b BY2b CY2b DY2b
c AY2c BY2c CY2c DY2c
3 a AY3a BY3a CY3a DY3a
b AY3b BY3b CY3b DY3b
c AY3c BY3c CY3c DY3c ## Heading ##
我想将A * B广播乘以B的最内层,我想要下面得到的数据帧R:
I´d like to multiply A * B broadcasting over the innermost level of B, I want the resulting dataframe R, below:
R= A B C D
X 1 a (AX1a * AX1) (BX1a * BX1) (CX1a * CX1) (DX1a * DX1)
b (AX1b * AX1) (BX1b * BX1) (CX1b * CX1) (DX1b * DX1)
c (AX1c * AX1) (BX1c * BX1) (CX1c * CX1) (DX1c * DX1)
2 a (AX2a * AX2) (BX2a * BX2) (CX2a * CX2) (DX2a * DX2)
b (AX2b * AX2) (BX2b * BX2) (CX2b * CX2) (DX2b * DX2)
c (AX2c * AX2) (BX2c * BX2) (CX2c * CX2) (DX2c * DX2)
3 a (AX3a * AX3) (BX3a * BX3) (CX3a * CX3) (DX3a * DX3)
b (AX3b * AX3) (BX3b * BX3) (CX3b * CX3) (DX3b * DX3)
c (AX3c * AX3) (BX3c * BX3) (CX3c * CX3) (DX3c * DX3)
Y 1 a (AY1a * AY1) (BY1a * BY1) (CY1a * CY1) (DY1a * DY1)
b (AY1b * AY1) (BY1b * BY1) (CY1b * CY1) (DY1b * DY1)
c (AY1c * AY1) (BY1c * BY1) (CY1c * CY1) (DY1c * DY1)
2 a (AY2a * AY2) (BY2a * BY2) (CY2a * CY2) (DY2a * DY2)
b (AY2b * AY2) (BY2b * BY2) (CY2b * CY2) (DY2b * DY2)
c (AY2c * AY2) (BY2c * BY2) (CY2c * CY2) (DY2c * DY2)
3 a (AY3a * AY3) (BY3a * BY3) (CY3a * CY3) (DY3a * DY3)
b (AY3b * AY3) (BY3b * BY3) (CY3b * CY3) (DY3b * DY3)
c (AY3c * AY3) (BY3c * BY3) (CY3c * CY3) (DY3c * DY3)
我尝试通过以下方式将带有级别关键字的熊猫乘法功能使用:
I tried using pandas multiply function with level keyword by doing:
b.multiply(a, level=[0,1])
但是会引发错误:"TypeError:两个MultiIndex对象之间的级别上的连接是不明确的"
but it throws an error: "TypeError: Join on level between two MultiIndex objects is ambiguous"
执行此操作的正确方法是什么?
What is the right way of doing this operation?
推荐答案
提议的方法
Proposed approach
We are talking about broadcasting
, thus I would like to bring in NumPy supported broadcasting
here.
解决方案代码如下所示-
The solution code would look something like this -
def numpy_broadcasting(df0, df1):
m,n,r = map(len,df1.index.levels)
a0 = df0.values.reshape(m,n,-1)
a1 = df1.values.reshape(m,n,r,-1)
out = (a1*a0[...,None,:]).reshape(-1,a1.shape[-1])
df_out = pd.DataFrame(out, index=df1.index, columns=df1.columns)
return df_out
基本概念:
1]以多维数组的形式获取数据框中的视图.多维性是根据multindex数据帧的级别结构维护的.因此,第一个数据帧将具有三个级别(包括列),第二个数据帧将具有四个级别.因此,我们有a0
和a1
对应于输入数据帧df0
和df1
,从而导致a0
和a1
的尺寸分别为3
和4
.
1] Get views into the dataframe as multidimensional arrays. The multidimensionality is maintained according to the level structure of the multindex dataframe. Thus, the first dataframe would have three levels (including the columns) and the second one has four levels. Thus, we have a0
and a1
corresponding to the input dataframes df0
and df1
, resulting in a0
and a1
having 3
and 4
dimensions respectively.
2)现在,是广播部分.通过在第三个位置引入新轴,我们简单地将a0
扩展为4个维度.该新轴将与df1
中的第三轴匹配.这使我们能够执行逐元素乘法.
2) Now, comes the broadcasting part. We simply extend a0
to have 4 dimensions by introducing a new axis at the third position. This new axis would match up against the third axis from df1
. This allows us to perform element-wise multiplication.
3)最后,要获得输出multindex数据帧,我们只需对产品进行整形.
3) Finally, to get the output multindex dataframe, we simply reshape the product.
样品运行:
1)输入数据帧-
In [369]: df0
Out[369]:
A B C D
0 0 3 2 2 3
1 6 8 1 0
2 3 5 1 5
1 0 7 0 3 1
1 7 0 4 6
2 2 0 5 0
In [370]: df1
Out[370]:
A B C D
0 0 0 4 6 1 2
1 3 3 4 5
2 8 1 7 4
1 0 7 2 5 4
1 8 6 7 5
2 0 4 7 1
2 0 1 4 2 2
1 2 3 8 1
2 0 0 5 7
1 0 0 8 6 1 7
1 0 6 1 4
2 5 4 7 4
1 0 4 7 0 1
1 4 2 6 8
2 3 1 0 6
2 0 8 4 7 4
1 0 6 2 0
2 7 8 6 1
2)输出数据帧-
In [371]: df_out
Out[371]:
A B C D
0 0 0 12 12 2 6
1 9 6 8 15
2 24 2 14 12
1 0 42 16 5 0
1 48 48 7 0
2 0 32 7 0
2 0 3 20 2 10
1 6 15 8 5
2 0 0 5 35
1 0 0 56 0 3 7
1 0 0 3 4
2 35 0 21 4
1 0 28 0 0 6
1 28 0 24 48
2 21 0 0 36
2 0 16 0 35 0
1 0 0 10 0
2 14 0 30 0
基准化
In [31]: # Setup input dataframes of the same shape as stated in the question
...: individuals = list(range(2))
...: time = (0, 1, 2)
...: index = pd.MultiIndex.from_tuples(list(product(individuals, time)))
...: A = pd.DataFrame(data={'A': np.random.randint(0,9,6), \
...: 'B': np.random.randint(0,9,6), \
...: 'C': np.random.randint(0,9,6), \
...: 'D': np.random.randint(0,9,6)
...: }, index=index)
...:
...:
...: individuals = list(range(2))
...: time = (0, 1, 2)
...: P = (0,1,2)
...: index = pd.MultiIndex.from_tuples(list(product(individuals, time, P)))
...: B = pd.DataFrame(data={'A': np.random.randint(0,9,18), \
...: 'B': np.random.randint(0,9,18), \
...: 'C': np.random.randint(0,9,18), \
...: 'D': np.random.randint(0,9,18)}, index=index)
...:
# @DSM's solution
In [32]: %timeit B * A.loc[B.index.droplevel(2)].set_index(B.index)
1 loops, best of 3: 8.75 ms per loop
# @Nickil Maveli's solution
In [33]: %timeit B.multiply(A.reindex(B.index, method='ffill'))
1000 loops, best of 3: 625 µs per loop
# @root's solution
In [34]: %timeit B * np.repeat(A.values, 3, axis=0)
1000 loops, best of 3: 487 µs per loop
In [35]: %timeit numpy_broadcasting(A, B)
1000 loops, best of 3: 191 µs per loop
这篇关于 pandas 将具有多索引和重叠索引级别的数据框相乘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!