迭代性能 [英] Iterrows performance
问题描述
我正在研究 python 2.7、pandas(版本 0.18.1)数据框.我必须根据同一数据框中的几列修改数据框中的一列.
I'm working on python 2.7, pandas ( version 0.18.1 ) data frames. I have to modify a column in the data frame based on several columns in the same data frame.
为此,我编写了如下代码示例数据如下
For that I have written my code as below Sample data is like below
data
是我的数据框
我的样本数据就像
+---+---+----+----+---+---------+---+----+----+---+----------+
| a | b | c | d | e | f | g | h | i | j | discount |
+---+---+----+----+---+---------+---+----+----+---+----------+
| 0 | | | | | 65497.6 | | | | | 0 |
| 0 | | | | | 73882.8 | | | | | 0 |
| 0 | | | | | 88588 | | 22 | | | 0 |
| 0 | | | | | 106480 | | 20 | 10 | | 0 |
| 0 | | | | | 52500 | | | | | 0 |
| 0 | | 20 | 10 | | 22997.5 | | | | | 0 |
| | | | | | | | | | | 0 |
| 0 | | | 20 | | 0 | | | | | 0 |
| 0 | | | | | 10520 | | | | | 0 |
+---+---+----+----+---+---------+---+----+----+---+----------+
我的代码如下
columns1 = ['a','b','c','d','e']
columns2 = ['f','g','h','i','j']
data['discount'] = 0
for i, row in data.iterrows():
a = 0
b = 0
for col1 in columns1 :
value = row[col1]
if value > 0:
a = value
break;
for col2 in columns2 :
value = row[col2]
if value > 0:
b = value
break;
if( a != 0 and b != 0):
data.loc[i, 'discount'] = abs(a-b)
当我这样做时,它在大型数据集上花费了大量时间和大量内存.我有 700MB 的数据,它需要超过 120GB 的 RAM 来处理,大约 10 小时后,进程给出了异常,说 Memory Error
As I'm doing this way it is taking lot of time and lot of memory on the large dataset. I have 700MB of data, It is taking more than 120GB of RAM to process and approximately after 10 hours process is giving the exception saying Memory Error
根据这个https://stackoverflow.com/a/24871316,我不应该那样使用,请让我知道如何更有效地编写此代码.
according to this https://stackoverflow.com/a/24871316, I should not use like that, Please let me know how can I write this code more efficient.
请告诉我否决我的问题的原因,以便我学习
推荐答案
假设您的空单元格是 NaN
值,这会为您提供的列组的每一行的第一个非 NA 值有兴趣:
Assuming your empty cells are NaN
values, this gives you the first non-NA value of each row for the group of columns you are interested in:
df[df>0][columns1].bfill(axis=1).iloc[:,0]
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 20.0
6 NaN
7 20.0
8 NaN
因此,这将为您提供您正在搜索的 abs(a-b)
:
Thus, this will give you the abs(a-b)
you're searching for:
res = (df[df>0][columns1].bfill(axis=1).iloc[:,0]
-df[df>0][columns2].bfill(axis=1).iloc[:,0]).abs()
res
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 22977.5
6 NaN
7 NaN
8 NaN
您可以将其与已初始化的 discount
列结合使用:
You can either combine it with your initialized discount
column:
res.combine_first(df.discount)
或填空:
res.fillna(0)
这篇关于迭代性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!