快速复杂的 pandas 字典操作 [英] Speedy complex Pandas & dictionary manipulation

查看:101
本文介绍了快速复杂的 pandas 字典操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Pandas的新手,请教一些棘手的数据透视表操作问题。

I am new to Pandas and seeking some advice on a tricky pivot table manipulation please.

我有两个Pandas数据透视表和一个字典。第一个数据透视表的某些值为零。第二个数据透视表具有相同的因子和级别,但值不同。该词典是每个因素所有可能的水平对的集合。示例代码:

I have two Pandas pivot tables and a dictionary. The first pivot table has some values that are zero. The second pivot table has the same factors and levels but different values. The dictionary is a set of all possible level pairs for each factor. Sample code:

df = pd.DataFrame({'MyColumn1': ['A', 'A', 'B', 'B'],
                   'MyColumn2': ['M', 'N', 'M', 'P'],
                   'Value': [1, 1, 1, 1]})
table = pd.pivot_table(df, values='Value', index=['MyColumn1', 'MyColumn2'], aggfunc=np.sum, fill_value = 0, dropna = False)

df2 = pd.DataFrame({'MyColumn1': ['A', 'A', 'A', 'B', 'B', 'B'],
                   'MyColumn2': ['M', 'N', 'P', 'M', 'N', 'P'],
                   'Value': [5, 10, 15, 20, 25, 30]})
table2 = pd.pivot_table(df2, values='Value', index=['MyColumn1', 'MyColumn2'], aggfunc=np.sum)

myDictionary = {'MyColumn1': {('A', 'A'): 10, ('A', 'B'): 20, 
                  ('B', 'A'): 30, ('B', 'B'): 40},
        'MyColumn2': {('M', 'M'): 1, ('M', 'N'): 2, ('M', 'P'): 3,
                  ('N', 'M'): 4, ('N', 'N'): 5, ('N', 'P'): 6,
                  ('P', 'M'): 7, ('P', 'N'): 8, ('P', 'P'): 9}}

此代码生成以下表格:

TABLE                Value
MyColumn1 MyColumn2       
A         M              1
          N              1
          P              0
B         M              1
          N              0
          P              1

TABLE2               Value
MyColumn1 MyColumn2       
A         M              5
          N             10
          P             15
B         M             20
          N             25
          P             30

对于中的每个非零条目,我想(1)遍历表中的每个零条目,然后在myDictionary中找到与零输入键和非零输入键相对应的数字的乘积,(2)将每个乘积乘以对应的值 Table2 为零行,然后(3)找到这些产品的总和cts。

For every non-zero entry in the Table, I'd like to (1) go through every zero entry in the Table and find the product of the numbers in myDictionary that correspond to the zero entry key and the non-zero entry key, (2) multiply each product by the corresponding value in Table2 for the zero'd row, and then (3) find the sum of these products.

例如,使用上述数据,算法将计算:

For instance, with the data above, the algorithm would calculate:

(A, P) -> (A, M) = 10 * 7 = 70, 70 * 15 = 1050
(A, P) -> (A, N) = 10 * 8 = 80, 80 * 15 = 1200
(A, P) -> (B, M) = 20 * 7 = 140, 140 * 15 = 2100
(A, P) -> (B, P) = 20 * 9 = 180, 180 * 15 = 2700

(B, N) -> (A, M) = 30 * 4 = 120, 120 * 25 = 3000
(B, N) -> (A, N) = 30 * 5 = 150, 150 * 25 = 3750
(B, N) -> (B, M) = 40 * 4 = 160, 160 * 25 = 4000
(B, N) -> (B, P) = 40 * 6 = 240, 240 * 25 = 6000

前四行对应于中的第一个零条目,后四行对应于第二个零条目。因为有四个非零条目,所以有四行。接下来,该算法乘以 Table2 中的相应值。最后,它将为每个非零条目加起来:

The first four rows correspond to the first zero entry, and the second four rows correspond to the second zero entry in the Table. There are four rows because there are four non-zero entries. Next, the algorithm multiplies by the corresponding value in Table2. Finally, it would add up for each non-zero entry:

(A, M) total = 1050 + 3000 = 4050
(A, N) total = 1200 + 3750 = 4950
(B, M) total = 2100 + 4000 = 6100
(B, P) total = 2700 + 6000 = 8700

所需的结果是与原始表形状相同但将这些总和作为值的数据透视表。

The desired result is a pivot table of the same shape as the original table but with these sums as the values.

                     Value
MyColumn1 MyColumn2       
A         M           4050
          N           4950
          P              0
B         M           6100
          N              0
          P           8700

我正在寻找有效地执行此操作的方法,因为在实践中,我将有100,000+个零条目,< 1000个非零条目和〜10列。出于性能原因,我不确定如何构建循环,即使循环是执行此操作的最有效方法,也不确定。任何帮助将不胜感激,谢谢:)

I am looking for a way to do this efficiently because, in practice, I will have 100,000+ zero entries, <1000 non-zero entries, and ~10 columns. I'm not sure how to structure the loops for performance reasons, or even if loops are the most efficient way to do this. Any help would be much appreciated, thank you :)

推荐答案

不确定这对您的实际数据有多快,但这是我该怎么做:

Not sure how fast this is on your actual data, but this is what I would do:

col1_df = pd.Series(myDictionary['MyColumn1']).unstack()
col2_df = pd.Series(myDictionary['MyColumn2']).unstack()

out_df = pd.DataFrame()

# loop through columns
for col in table.columns:    
    zeros = table['Value'].eq(0)
    non_zero_idx = np.array(table.index[~zeros].to_list())
    zero_idx = np.array(table.index[zeros].to_list())

    num_nz, num_z = len(non_zero_idx), len(zero_idx)

    xs,ys = np.meshgrid(np.arange(num_z),np.arange(num_nz))
    xs, ys = xs.ravel(), ys.ravel()

    col1 = col1_df.lookup(zero_idx[xs,0], non_zero_idx[ys,0])
    col2 = col2_df.lookup(zero_idx[xs,1], non_zero_idx[ys,1])

    prods = (col1* col2).reshape(num_nz, num_z).T

    values = table2.loc[zeros, ['Value']].values

    out_df[col] = (pd.Series((prods * values).sum(0), index=non_zero_idx)
                     .reindex(table.index, fill_value=0)
                  )

输出:

                     Value
MyColumn1 MyColumn2       
A         M           4050
          N           4950
          P              0
B         M           6100
          N              0
          P           8700

这篇关于快速复杂的 pandas 字典操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆