使用Python函数有效处理DataFrame行? [英] Efficiently processing DataFrame rows with a Python function?
问题描述
在使用熊猫的代码中的许多地方,我们都有一些Python函数process(row)
.该函数用在DataFrame.iterrows()
上,获取每个row
,进行一些处理,然后返回一个值,我们最终将其收集到新的Series
中.
In many places in our Pandas-using code, we have some Python function process(row)
. That function is used over DataFrame.iterrows()
, taking each row
, and doing some processing, and returning a value, which we ultimate collect into a new Series
.
我意识到这种用法模式绕过了numpy/Pandas堆栈的大部分性能优势.
I realize this usage pattern circumvents most of the performance benefits of the numpy / Pandas stack.
- 使这种使用方式高效的最佳方法是什么 可能吗?
- 我们能否在不重写大部分代码的情况下做到这一点?
- What would be the best way to make this usage pattern as efficient as possible?
- Can we possibly do it without rewriting most of our code?
此问题的另一方面:是否可以将所有此类函数都转换为有效的numpy表示形式?我有很多关于numpy/scipy/Pandas堆栈的知识,但是似乎对于真正的任意逻辑,有时您可能只需要使用一种慢速的纯Python架构,就像上面的那样.是这样吗?
Another aspect of this question: can all such functions be converted to a numpy-efficient representation? I've much to learn about the numpy / scipy / Pandas stack, but it seems that for truly arbitrary logic, you may sometimes need to just use a slow pure Python architecture like the one above. Is that the case?
推荐答案
您应沿axis = 1应用函数.函数将收到一行作为参数,返回的任何内容将被收集到一个新的系列对象中
You should apply your function along the axis=1. Function will receive a row as an argument, and anything it returns will be collected into a new series object
df.apply(you_function, axis=1)
示例:
>>> df = pd.DataFrame({'a': np.arange(3),
'b': np.random.rand(3)})
>>> df
a b
0 0 0.880075
1 1 0.143038
2 2 0.795188
>>> def func(row):
return row['a'] + row['b']
>>> df.apply(func, axis=1)
0 0.880075
1 1.143038
2 2.795188
dtype: float64
对于问题的第二部分:使用熊猫apply
的逐行操作,甚至是优化操作,也不是最快的解决方案.它们肯定比python for循环快很多,但不是最快.您可以通过计时操作来测试它,然后您会发现其中的区别.
As for the second part of the question: row wise operations, even optimised ones, using pandas apply
, are not the fastest solution there is. They are certainly a lot faster than a python for loop, but not the fastest. You can test that by timing operations and you'll see the difference.
某些操作可以转换为面向列的操作(在我的示例中,一个操作可以很容易地转换为df['a'] + df['b']
),但是其他操作则不能.特别是如果您有很多分支,特殊情况或应在您的行上执行的其他逻辑.在这种情况下,如果apply
对您来说太慢,我建议您使用"Cython-izing" 代码. Cython可以很好地与NumPy C api配合使用,并且可以为您提供最快的速度.
Some operation could be converted to column oriented ones (one in my example could be easily converted to just df['a'] + df['b']
), but others cannot. Especially if you have a lot of branching, special cases or other logic that should be perform on your row. In that case, if the apply
is too slow for you, I would suggest "Cython-izing" your code. Cython plays really nicely with the NumPy C api and will give you the maximal speed you can achieve.
或者您可以尝试 numba . :)
Or you can try numba. :)
这篇关于使用Python函数有效处理DataFrame行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!