通过替换iterrows加快pandas代码 [英] Speeding up pandas code by replacing iterrows

查看:81
本文介绍了通过替换iterrows加快pandas代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框

I have a Dataframe like below

+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001   |     1720 |   260 |  1500 |   1 |        0 |      1500 |
| Inv_001   |     1720 |   777 |   100 |   1 |      260 |       100 |
| Inv_001   |     1720 |   888 |   120 |   1 |      260 |       120 |
| Inv_002   |     1160 |   360 |   700 |   1 |        0 |       700 |
| Inv_002   |     1160 |   777 |   100 |   1 |      360 |       100 |
| Inv_002   |     1160 |   888 |   120 |   1 |      360 |       120 |
| Inv_002   |     1160 |   999 |   140 |   1 |      360 |       140 |
| Inv_002   |     1160 |   111 |   100 |   1 |        0 |       100 |
+-----------+----------+-------+-------+-----+----------+-----------+

我想添加ProdTotal值,其MainCode等于Item#. 从我对问题,我设法产生了下面提到的所需输出

I want to add the ProdTotal value, whose MainCode is equal to the Item#. Inspired from the answers I got for my question, I managed to produce the desired output mentioned below

+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001   |     1720 |   260 |  1720 |   1 |        0 |      1720 |
| Inv_002   |     1160 |   360 |  1060 |   1 |        0 |      1060 |
| Inv_002   |     1160 |   111 |   100 |   1 |        0 |       100 |
+-----------+----------+-------+-------+-----+----------+-----------+

使用下面的代码

df = pd.read_csv('data.csv')
df_grouped = dict(tuple(df.groupby(['InvoiceNo'])))

remove_index= []
ids = 0

for x in df_grouped:
    for index, row in df_grouped[x].iterrows():
        ids += 1
        try:
            main_code_data = df_grouped[x].loc[df_grouped[x]['MainCode'] == row['Item#']]
            length = len(main_code_data['Item#'])
            iterator = 0
            index_value = 0    
            for i in range(len(df_grouped[x].index)):
                index_value += df_grouped[x].at[index + iterator, 'ProdTotal']
                df.at[index, 'ProdTotal'] = index_value

                iterator += 1

            for item in main_code_data.index:
                remove_index.append(item)

        except:
            pass

df = df.drop(remove_index)

但是数据包含数百万行,并且此代码运行非常缓慢.简短的Google搜索和从其他成员的评论中,我知道iterrows()正在使代码运行缓慢.如何替换iterrows()以使我的代码更高效,更pythonic?

But the data consists of millions of rows and this code runs very slowly. A brief google search & comments from other members, I got to know that iterrows() is making the code run slow. How can I replace iterrows() to make my code more efficient and more pythonic?

推荐答案

这适用于示例数据.对您的实际数据有效吗?

This works on the sample data. Does it work on your actual data?

# Sample data.
df = pd.DataFrame({
    'InvoiceNo': ['Inv_001'] * 3 + ['Inv_002'] * 5,
    'totalamt': [1720] * 3 + [1160] * 5,
    'Item#': [260, 777, 888, 260, 777, 888, 999, 111],
    'price': [1500, 100, 120, 700, 100, 120, 140, 100],
    'qty': [1] * 8,
    'MainCode': [0, 260, 260, 0, 260, 260, 260, 0],
    'ProdTotal': [1500, 100, 120, 700 ,100 ,120, 140, 100]
})

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})

result = df[df['MainCode'].eq(0)]
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result['price'] = result.eval('ProdTotal / qty')
result = result.drop(columns=['ProdSubTotal'])

>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0

我们首先要获取每个InvoiceNoMainCode的总计ProdTotal(但仅在MainCode不等于零的情况下,.ne(0)):

We first want to get the aggregate ProdTotal per InvoiceNo and MainCode (but only in the case where the MainCode is not equal to zero, .ne(0)):

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
>>> subtotals
  InvoiceNo  MainCode  ProdTotal
0   Inv_001       260        220
1   Inv_002       260        360

然后我们需要从主数据帧中过滤此数据,因此我们只过滤MainCode等于零的位置,.eq(0).

We then need to filter this data from the main dataframe, so we just filter where the MainCode equals zero, .eq(0).

result = df[df['MainCode'].eq(0)]
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      1720    260   1500    1         0       1500
3   Inv_002      1160    260    700    1         0        700
7   Inv_002      1160    111    100    1         0        100

我们希望将小计与该结果相结合,其中InvoiceNo匹配且result中的Item#subtotal中的MainCode匹配.一种方法是更改​​subtotal中的列名,然后执行左合并:

We want to join the subtotals to this result where the InvoiceNo matches and the Item# in result matches the MainCode in subtotal. One way to do this is change the column names in subtotal and then perform a left merge:

subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal  ProdSubTotal
0   Inv_001      1720    260   1500    1         0       1500         220.0
1   Inv_002      1160    260    700    1         0        700         360.0
2   Inv_002      1160    111    100    1         0        100           NaN

现在,我们将ProdSubTotal添加到ProdTotal并删除该列.

Now we add the ProdSubTotal to the ProdTotal and drop the column.

result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result = result.drop(columns=['ProdSubTotal'])
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      1720    260   1500    1         0     1720.0
1   Inv_002      1160    260    700    1         0     1060.0
2   Inv_002      1160    111    100    1         0      100.0

最后,我们根据给定的qty和新的ProdTotal重新计算price.

Finally, we recalculate the price given the qty and new ProdTotal.

result['price'] = result.eval('ProdTotal / qty')
>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0

这篇关于通过替换iterrows加快pandas代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆