在此示例中,避免使用迭代的好方法是什么? [英] What is a good way to avoid using iterrows in this example?
问题描述
我以前曾讨论过iterrows的性能问题 ,并且得到了很好的一般答复.这个问题是一个特定的案例,在这种情况下,我希望您能帮助您更好地应用某些东西,因为它的进度很慢.
I discussed the performance issues of iterrows previously, and was given good general responses. This question is a specific case where I'd like your help in applying something better, as iterrows is SLOW.
我相信这个问题对于那些对行迭代心态陷入僵局的新python/pandas程序员来说都是有用的.
I believe the question can be useful for any new python/pandas programmers who feel stuck thinking with a row iteration mentality.
我使用"map"或"apply"看到的示例通常显示一个似乎足够直观的数据表.但是,我正在处理两个表,它们很大(T1是250万行,T2是96000行).
The examples I've seen using 'map' or 'apply' generally show one datatable which seems intuitive enough. However, I am working across two tables and they are large (T1 is 2.5million rows, T2 is 96000 rows).
这是一个简单的示例(在我的会话中适用):
Here is a simple example (it works in my session):
import pandas as pd
import numpy as np
# Create the original tables
t1 = {'letter':['a','b'],
'number1':[50,-10]}
t2 = {'letter':['a','a','b','b'],
'number2':[0.2,0.5,0.1,0.4]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])
# Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():
t2info = table2[table2.letter == row['letter']].reset_index()
table3.ix[row_index,] = optimize(t2info,row['number1'])
# Define optimization
def optimize(t2info, t1info):
calculation = []
for index, r in t2info.iterrows():
calculation.append(r['number2']*t1info)
maxrow = calculation.index(max(calculation))
return t2info.ix[maxrow]
print table3
输出为:
letter number2
0 a 0.5
1 b 0.1
[2 rows x 2 columns]
总体思路:
- 产生表3是目标-它具有与表1相同的尺寸
- 根据表1的相应输入,使用表2中的最优"行填充表3.
- 表2中使用的数据是基于表1中字母"的子集
(显然,这种情况并不小,因为它很小,但是在处理数百万行时却是如此.请记住,在实际示例中,两个表中都有更多列.)
(Obviously this case is not slow because it is tiny, but when working with millions of rows it is. Bear in mind that in the real example I have more columns in both tables.)
推荐答案
在我看来,最简单的方法是先在letter
上合并,然后在groupby
上合并.
To me, looks like the easiest thing is to merge on letter
and then groupby
.
import pandas as pd
import numpy as np
# Create the original tables
t1 = {'letter':['a','b'],
'number1':[50,-10]}
t2 = {'letter':['a','a','b','b'],
'number2':[0.2,0.5,0.1,0.4]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
table3 = table1.merge(table2,on='letter')
grouped = table3.groupby('letter')
def get_optimization(df):
product_column = df.number1 * df.number2
idx_of_prod_col_max = product_columns.idxmax()
return_val = df.ix[idx_of_prod_col_max]['number2']
return return_val
table3 = grouped.apply(get_optimization)
这篇关于在此示例中,避免使用迭代的好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!