iterrows有性能问题吗? [英] Does iterrows have performance issues?

查看:157
本文介绍了iterrows有性能问题吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到在使用pandas中的iterrows时性能非常差。

I have noticed very poor performance when using iterrows from pandas.

这是其他人经历过的吗?它是否特定于iterrows,并且对于特定大小的数据(我工作2-3百万行)是否应该避免此功能?

Is this something that is experienced by others? Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?

关于GitHub的讨论让我相信它是在数据帧中混合dtypes时引起的,但是下面的简单示例表明它甚至在使用一个dtype(float64)。我的机器需要36秒:

This discussion on GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

为什么要进行矢量化这样的操作应用得那么快?我想也必须有一些逐行迭代。

Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.

我无法弄清楚如何在我的情况下不使用iterrows(这将为以后的问题保存)。因此,如果您一直能够避免这种迭代,我将不胜感激。我正在基于单独数据帧中的数据进行计算。谢谢!

I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes. Thank you!

---编辑:我想要运行的简化版本已添加到下面---

--- simplified version of what I want to run has been added below---

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]


推荐答案

一般来说, iterrows 只应在非常特殊的情况下使用。这是执行各种操作的一般优先顺序:

Generally, iterrows should only be used in very very specific cases. This is the general order of precedence for performance of various operations:

1) vectorization
2) using a custom cython routine
3) apply
    a) reductions that can be performed in cython
    b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)

使用自定义cython例程通常过于复杂,所以现在让我们跳过它。

Using a custom cython routine is usually too complicated, so let's skip that for now.

1)矢量化始终是第一个也是最好的选择。但是,有一小部分案例无法以明显的方式进行矢量化(主要涉及复发)。此外,在较小的框架上,执行其他方法可能会更快。

1) Vectorization is ALWAYS ALWAYS the first and best choice. However, there are a small set of cases which cannot be vectorized in obvious ways (mostly involving a recurrence). Further, on a smallish frame, it may be faster to do other methods.

3)应用涉及可以通常由迭代器完成Cython空间(这是在pandas内部完成的)(这是一个)情况。

3) Apply involves can usually be done by an iterator in Cython space (this is done internally in pandas) (this is a) case.

这取决于apply表达式内部的内容。例如 df.apply(lambda x:np.sum(x))会很快执行(当然 df.sum(1)甚至更好)。但是类似于: df.apply(lambda x:x ['b'] + 1)将在python空间中执行,因此速度较慢。

This is dependent on what is going on inside the apply expression. e.g. df.apply(lambda x: np.sum(x)) will be executed pretty swiftly (of course df.sum(1) is even better). However something like: df.apply(lambda x: x['b'] + 1) will be executed in python space, and consequently is slower.

4) itertuples 不会将数据装入系列,只需将其作为元组返回

4) itertuples does not box the data into a Series, just returns it as a tuple

5) iterrows 将数据装入系列。除非你真的需要这个,否则使用另一种方法。

5) iterrows DOES box the data into a Series. Unless you really need this, use another method.

6)一次更新空帧a-single-row。我见过这种方法过于使用了WAY。它是迄今为止最慢的。它可能是常见的(并且对于某些python结构来说相当快),但是DataFrame对索引进行了相当多的检查,因此一次更新行总是非常慢。更好地创建新结构和 concat

6) updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.

这篇关于iterrows有性能问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆