迭代性能 [英] Iterrows performance

查看:45
本文介绍了迭代性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 python 2.7、pandas(版本 0.18.1)数据框.我必须根据同一数据框中的几列修改数据框中的一列.

I'm working on python 2.7, pandas ( version 0.18.1 ) data frames. I have to modify a column in the data frame based on several columns in the same data frame.

为此,我编写了如下代码示例数据如下

For that I have written my code as below Sample data is like below

data 是我的数据框

我的样本数据就像

+---+---+----+----+---+---------+---+----+----+---+----------+
| a | b | c  | d  | e |    f    | g | h  | i  | j | discount |
+---+---+----+----+---+---------+---+----+----+---+----------+
| 0 |   |    |    |   | 65497.6 |   |    |    |   |        0 |
| 0 |   |    |    |   | 73882.8 |   |    |    |   |        0 |
| 0 |   |    |    |   | 88588   |   | 22 |    |   |        0 |
| 0 |   |    |    |   | 106480  |   | 20 | 10 |   |        0 |
| 0 |   |    |    |   | 52500   |   |    |    |   |        0 |
| 0 |   | 20 | 10 |   | 22997.5 |   |    |    |   |        0 |
|   |   |    |    |   |         |   |    |    |   |        0 |
| 0 |   |    | 20 |   | 0       |   |    |    |   |        0 |
| 0 |   |    |    |   | 10520   |   |    |    |   |        0 |
+---+---+----+----+---+---------+---+----+----+---+----------+

我的代码如下

columns1 = ['a','b','c','d','e']
columns2 = ['f','g','h','i','j']
data['discount'] = 0
for i, row in data.iterrows():
    a = 0
    b = 0
    for col1 in columns1 :
      value = row[col1]
      if value > 0:
         a = value
         break;
    for col2 in columns2 :
      value = row[col2]
      if value > 0:
         b = value
         break;
    if( a != 0 and b != 0):
        data.loc[i, 'discount'] = abs(a-b)

当我这样做时,它在大型数据集上花费了大量时间和大量内存.我有 700MB 的数据,它需要超过 120GB 的 RAM 来处理,大约 10 小时后,进程给出了异常,说 Memory Error

As I'm doing this way it is taking lot of time and lot of memory on the large dataset. I have 700MB of data, It is taking more than 120GB of RAM to process and approximately after 10 hours process is giving the exception saying Memory Error

根据这个https://stackoverflow.com/a/24871316,我不应该那样使用,请让我知道如何更有效地编写此代码.

according to this https://stackoverflow.com/a/24871316, I should not use like that, Please let me know how can I write this code more efficient.

请告诉我否决我的问题的原因,以便我学习

推荐答案

假设您的空单元格是 NaN 值,这会为您提供的列组的每一行的第一个非 NA 值有兴趣:

Assuming your empty cells are NaN values, this gives you the first non-NA value of each row for the group of columns you are interested in:

df[df>0][columns1].bfill(axis=1).iloc[:,0]

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5    20.0
6     NaN
7    20.0
8     NaN

因此,这将为您提供您正在搜索的 abs(a-b):

Thus, this will give you the abs(a-b) you're searching for:

res = (df[df>0][columns1].bfill(axis=1).iloc[:,0]
      -df[df>0][columns2].bfill(axis=1).iloc[:,0]).abs()
res

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5    22977.5
6        NaN
7        NaN
8        NaN

您可以将其与已初始化的 discount 列结合使用:

You can either combine it with your initialized discount column:

res.combine_first(df.discount)

或填空:

res.fillna(0)

这篇关于迭代性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆