使用Python函数有效处理DataFrame行? [英] Efficiently processing DataFrame rows with a Python function?

查看:275
本文介绍了使用Python函数有效处理DataFrame行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用熊猫的代码中的许多地方,我们都有一些Python函数process(row).该函数用在DataFrame.iterrows()上,获取每个row,进行一些处理,然后返回一个值,我们最终将其收集到新的Series中.

In many places in our Pandas-using code, we have some Python function process(row). That function is used over DataFrame.iterrows(), taking each row, and doing some processing, and returning a value, which we ultimate collect into a new Series.

我意识到这种用法模式绕过了numpy/Pandas堆栈的大部分性能优势.

I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.

  1. 使这种使用方式高效的最佳方法是什么 可能吗?
  2. 我们能否在不重写大部分代码的情况下做到这一点?
  1. What would be the best way to make this usage pattern as efficient as possible?
  2. Can we possibly do it without rewriting most of our code?

此问题的另一方面:是否可以将所有此类函数都转换为有效的numpy表示形式?我有很多关于numpy/scipy/Pandas堆栈的知识,但是似乎对于真正的任意逻辑,有时您可能只需要使用一种慢速的纯Python架构,就像上面的那样.是这样吗?

Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?

推荐答案

您应沿axis = 1应用函数.函数将收到一行作为参数,返回的任何内容将被收集到一个新的系列对象中

You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object

df.apply(you_function, axis=1)

示例:

>>> df = pd.DataFrame({'a': np.arange(3),
                       'b': np.random.rand(3)})
>>> df
   a         b
0  0  0.880075
1  1  0.143038
2  2  0.795188
>>> def func(row):
        return row['a'] + row['b']
>>> df.apply(func, axis=1)
0    0.880075
1    1.143038
2    2.795188
dtype: float64

对于问题的第二部分:使用熊猫apply的逐行操作,甚至是优化操作,也不是最快的解决方案.它们肯定比python for循环快很多,但不是最快.您可以通过计时操作来测试它,然后您会发现其中的区别.

As for the second part of the question: row wise operations, even optimised ones, using pandas apply, are not the fastest solution there is. They are certainly a lot faster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.

某些操作可以转换为面向列的操作(在我的示例中,一个操作可以很容易地转换为df['a'] + df['b']),但是其他操作则不能.特别是如果您有很多分支,特殊情况或应在您的行上执行的其他逻辑.在这种情况下,如果apply对您来说太慢,我建议您使用"Cython-izing" 代码. Cython可以很好地与NumPy C api配合使用,并且可以为您提供最快的速度.

Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the apply is too slow for you, I would suggest "Cython-izing" your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.

或者您可以尝试 numba . :)

Or you can try numba. :)

这篇关于使用Python函数有效处理DataFrame行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆