在此示例中,避免使用迭代的好方法是什么? [英] What is a good way to avoid using iterrows in this example?

查看:87
本文介绍了在此示例中,避免使用迭代的好方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以前曾讨论过iterrows的性能问题 ,并且得到了很好的一般答复.这个问题是一个特定的案例,在这种情况下,我希望您能帮助您更好地应用某些东西,因为它的进度很慢.

I discussed the performance issues of iterrows previously, and was given good general responses. This question is a specific case where I'd like your help in applying something better, as iterrows is SLOW.

我相信这个问题对于那些对行迭代心态陷入僵局的新python/pandas程序员来说都是有用的.

I believe the question can be useful for any new python/pandas programmers who feel stuck thinking with a row iteration mentality.

我使用"map"或"apply"看到的示例通常显示一个似乎足够直观的数据表.但是,我正在处理两个表,它们很大(T1是250万行,T2是96000行).

The examples I've seen using 'map' or 'apply' generally show one datatable which seems intuitive enough. However, I am working across two tables and they are large (T1 is 2.5million rows, T2 is 96000 rows).

这是一个简单的示例(在我的会话中适用):

Here is a simple example (it works in my session):

import pandas as pd
import numpy as np

# Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

# Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

# Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

# Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

print table3

输出为:

  letter number2
0      a     0.5
1      b     0.1

[2 rows x 2 columns]

总体思路:

  1. 产生表3是目标-它具有与表1相同的尺寸
  2. 根据表1的相应输入,使用表2中的最优"行填充表3.
  3. 表2中使用的数据是基于表1中字母"的子集

(显然,这种情况并不小,因为它很小,但是在处理数百万行时却是如此.请记住,在实际示例中,两个表中都有更多列.)

(Obviously this case is not slow because it is tiny, but when working with millions of rows it is. Bear in mind that in the real example I have more columns in both tables.)

推荐答案

在我看来,最简单的方法是先在letter上合并,然后在groupby上合并.

To me, looks like the easiest thing is to merge on letter and then groupby.

import pandas as pd
import numpy as np

# Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

table3 = table1.merge(table2,on='letter')

grouped = table3.groupby('letter')

def get_optimization(df):
    product_column = df.number1 * df.number2
    idx_of_prod_col_max = product_columns.idxmax()
    return_val = df.ix[idx_of_prod_col_max]['number2']
    return return_val

table3 = grouped.apply(get_optimization)

这篇关于在此示例中,避免使用迭代的好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆