快速复杂的 pandas 字典操作 [英] Speedy complex Pandas & dictionary manipulation
问题描述
我是Pandas的新手,请教一些棘手的数据透视表操作问题。
I am new to Pandas and seeking some advice on a tricky pivot table manipulation please.
我有两个Pandas数据透视表和一个字典。第一个数据透视表的某些值为零。第二个数据透视表具有相同的因子和级别,但值不同。该词典是每个因素所有可能的水平对的集合。示例代码:
I have two Pandas pivot tables and a dictionary. The first pivot table has some values that are zero. The second pivot table has the same factors and levels but different values. The dictionary is a set of all possible level pairs for each factor. Sample code:
df = pd.DataFrame({'MyColumn1': ['A', 'A', 'B', 'B'],
'MyColumn2': ['M', 'N', 'M', 'P'],
'Value': [1, 1, 1, 1]})
table = pd.pivot_table(df, values='Value', index=['MyColumn1', 'MyColumn2'], aggfunc=np.sum, fill_value = 0, dropna = False)
df2 = pd.DataFrame({'MyColumn1': ['A', 'A', 'A', 'B', 'B', 'B'],
'MyColumn2': ['M', 'N', 'P', 'M', 'N', 'P'],
'Value': [5, 10, 15, 20, 25, 30]})
table2 = pd.pivot_table(df2, values='Value', index=['MyColumn1', 'MyColumn2'], aggfunc=np.sum)
myDictionary = {'MyColumn1': {('A', 'A'): 10, ('A', 'B'): 20,
('B', 'A'): 30, ('B', 'B'): 40},
'MyColumn2': {('M', 'M'): 1, ('M', 'N'): 2, ('M', 'P'): 3,
('N', 'M'): 4, ('N', 'N'): 5, ('N', 'P'): 6,
('P', 'M'): 7, ('P', 'N'): 8, ('P', 'P'): 9}}
此代码生成以下表格:
TABLE Value
MyColumn1 MyColumn2
A M 1
N 1
P 0
B M 1
N 0
P 1
TABLE2 Value
MyColumn1 MyColumn2
A M 5
N 10
P 15
B M 20
N 25
P 30
对于表
中的每个非零条目,我想(1)遍历表中的每个零条目表
,然后在myDictionary中找到与零输入键和非零输入键相对应的数字的乘积,(2)将每个乘积乘以对应的值 Table2
为零行,然后(3)找到这些产品的总和cts。
For every non-zero entry in the Table
, I'd like to (1) go through every zero entry in the Table
and find the product of the numbers in myDictionary that correspond to the zero entry key and the non-zero entry key, (2) multiply each product by the corresponding value in Table2
for the zero'd row, and then (3) find the sum of these products.
例如,使用上述数据,算法将计算:
For instance, with the data above, the algorithm would calculate:
(A, P) -> (A, M) = 10 * 7 = 70, 70 * 15 = 1050
(A, P) -> (A, N) = 10 * 8 = 80, 80 * 15 = 1200
(A, P) -> (B, M) = 20 * 7 = 140, 140 * 15 = 2100
(A, P) -> (B, P) = 20 * 9 = 180, 180 * 15 = 2700
(B, N) -> (A, M) = 30 * 4 = 120, 120 * 25 = 3000
(B, N) -> (A, N) = 30 * 5 = 150, 150 * 25 = 3750
(B, N) -> (B, M) = 40 * 4 = 160, 160 * 25 = 4000
(B, N) -> (B, P) = 40 * 6 = 240, 240 * 25 = 6000
前四行对应于表
中的第一个零条目,后四行对应于第二个零条目。因为有四个非零条目,所以有四行。接下来,该算法乘以 Table2
中的相应值。最后,它将为每个非零条目加起来:
The first four rows correspond to the first zero entry, and the second four rows correspond to the second zero entry in the Table
. There are four rows because there are four non-zero entries. Next, the algorithm multiplies by the corresponding value in Table2
. Finally, it would add up for each non-zero entry:
(A, M) total = 1050 + 3000 = 4050
(A, N) total = 1200 + 3750 = 4950
(B, M) total = 2100 + 4000 = 6100
(B, P) total = 2700 + 6000 = 8700
所需的结果是与原始表形状相同但将这些总和作为值的数据透视表。
The desired result is a pivot table of the same shape as the original table but with these sums as the values.
Value
MyColumn1 MyColumn2
A M 4050
N 4950
P 0
B M 6100
N 0
P 8700
我正在寻找有效地执行此操作的方法,因为在实践中,我将有100,000+个零条目,< 1000个非零条目和〜10列。出于性能原因,我不确定如何构建循环,即使循环是执行此操作的最有效方法,也不确定。任何帮助将不胜感激,谢谢:)
I am looking for a way to do this efficiently because, in practice, I will have 100,000+ zero entries, <1000 non-zero entries, and ~10 columns. I'm not sure how to structure the loops for performance reasons, or even if loops are the most efficient way to do this. Any help would be much appreciated, thank you :)
推荐答案
不确定这对您的实际数据有多快,但这是我该怎么做:
Not sure how fast this is on your actual data, but this is what I would do:
col1_df = pd.Series(myDictionary['MyColumn1']).unstack()
col2_df = pd.Series(myDictionary['MyColumn2']).unstack()
out_df = pd.DataFrame()
# loop through columns
for col in table.columns:
zeros = table['Value'].eq(0)
non_zero_idx = np.array(table.index[~zeros].to_list())
zero_idx = np.array(table.index[zeros].to_list())
num_nz, num_z = len(non_zero_idx), len(zero_idx)
xs,ys = np.meshgrid(np.arange(num_z),np.arange(num_nz))
xs, ys = xs.ravel(), ys.ravel()
col1 = col1_df.lookup(zero_idx[xs,0], non_zero_idx[ys,0])
col2 = col2_df.lookup(zero_idx[xs,1], non_zero_idx[ys,1])
prods = (col1* col2).reshape(num_nz, num_z).T
values = table2.loc[zeros, ['Value']].values
out_df[col] = (pd.Series((prods * values).sum(0), index=non_zero_idx)
.reindex(table.index, fill_value=0)
)
输出:
Value
MyColumn1 MyColumn2
A M 4050
N 4950
P 0
B M 6100
N 0
P 8700
这篇关于快速复杂的 pandas 字典操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!