我何时应该在代码中使用pandas apply()? [英] When should I ever want to use pandas apply() in my code?
问题描述
我已经看到许多有关使用Pandas方法apply
的堆栈溢出问题的答案.我还看到用户在他们的下面发表评论,说"apply
很慢,应避免使用".
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply
. I have also seen users commenting under them saying that "apply
is slow, and should be avoided".
我已经阅读了许多有关性能主题的文章,这些文章解释了apply
的运行速度很慢.我还在文档中看到了关于apply
只是传递UDF的便捷函数的免责声明(现在似乎找不到).因此,通常的共识是,如果可能,应避免使用apply
.但是,这引发了以下问题:
I have read many articles on the topic of performance that explain apply
is slow. I have also seen a disclaimer in the docs about how apply
is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply
should be avoided if possible. However, this raises the following questions:
- 如果
apply
太糟糕了,那么为什么要在API中使用它呢? - 如何以及何时使我的代码不使用
apply
? - 在任何情况下,
apply
都好(比其他可能的解决方案更好)吗?
- If
apply
is so bad, then why is it in the API? - How and when should I make my code
apply
-free? - Are there ever any situations where
apply
is good (better than other possible solutions)?
推荐答案
apply
,您不需要的便捷功能
我们首先在OP中逐一解决问题.
apply
, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"如果适用于太糟糕了,那为什么在API中呢?"
"If apply is so bad, then why is it in the API?"
Series.apply
是分别在DataFrame和Series对象上定义的便捷功能. apply
接受任何在DataFrame上应用转换/聚合的用户定义函数. apply
实际上是完成任何现有熊猫函数无法完成的工作的灵丹妙药.
DataFrame.apply
and Series.apply
are convenience functions defined on DataFrame and Series object respectively. apply
accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply
is effectively a silver bullet that does whatever any existing pandas function cannot do.
apply
可以做的一些事情:
- 在DataFrame或Series上运行任何用户定义的函数
- 在DataFrame上按行(
axis=1
)或按列(axis=0
)应用函数 - 在应用函数时执行索引对齐
- 使用用户定义的函数执行聚合(但是,在这种情况下,我们通常更喜欢
agg
或transform
) - 执行逐元素转换
- 将汇总结果广播到原始行(请参见
result_type
参数). - 接受位置/关键字参数以传递给用户定义的函数.
- Run any user-defined function on a DataFrame or Series
- Apply a function either row-wise (
axis=1
) or column-wise (axis=0
) on a DataFrame - Perform index alignment while applying the function
- Perform aggregation with user-defined functions (however, we usually prefer
agg
ortransform
in these cases) - Perform element-wise transformations
- Broadcast aggregated results to original rows (see the
result_type
argument). - Accept positional/keyword arguments to pass to the user-defined functions.
...等等.有关更多信息,请参见文档中的行或列函数应用程序.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
那么,有了所有这些功能,为什么apply
不好? 是因为apply
是 慢. Pandas不对您的函数的性质进行任何假设,因此将您的函数迭代地应用于必要的每一行/列.此外,处理上述所有情况都意味着apply
在每次迭代时都会产生一些重大开销.此外,apply
会消耗更多的内存,这对于内存受限的应用程序是一个挑战.
So, with all these features, why is apply
bad? It is because apply
is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply
incurs some major overhead at each iteration. Further, apply
consumes a lot more memory, which is a challenge for memory bounded applications.
很少有适合使用apply
的情况(以下更多内容). 如果不确定是否应使用apply
,则可能不应该使用.
There are very few situations where apply
is appropriate to use (more on that below). If you're not sure whether you should be using apply
, you probably shouldn't.
让我们解决下一个问题.
Let's address the next question.
"我应该如何以及何时使代码免费使用?"
换句话说,在某些常见情况下,您将摆脱对apply
的任何调用.
To rephrase, here are some common situations where you will want to get rid of any calls to apply
.
如果您正在使用数字数据,则可能已经有一个矢量化的cython函数可以完全实现您要执行的操作(如果不行,请在Stack Overflow上提问或在GitHub上打开功能请求)
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
对比apply
的性能,以执行简单的加法操作.
Contrast the performance of apply
for a simple addition operation.
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
在性能方面,没有任何可比性,经过cythonized处理的等效项要快得多.不需要图表,因为即使对于玩具数据,差异也很明显.
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
即使您启用了使用raw
参数传递原始数组的速度,其速度仍然是原来的两倍.
Even if you enable passing raw arrays with the raw
argument, it's still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
另一个例子:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
通常,可能的话,寻找矢量化的替代方案.
Pandas在大多数情况下都提供矢量化"字符串函数,但是在极少数情况下,这些函数没有...应用",可以这么说.
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
一个常见的问题是检查同一行的另一列中是否存在一列中的值.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
这应该返回第二行和第三行,因为唐纳德"和米妮"分别出现在其标题"列中.
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
使用套用,将使用
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
但是,使用列表推导可以找到更好的解决方案.
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
在这里要注意的是,由于开销较低,因此迭代例程的速度比apply
快.如果需要处理NaN和无效的dtype,则可以使用自定义函数在此基础上进行构建,然后再使用列表推导中的参数进行调用.
The thing to note here is that iterative routines happen to be faster than apply
, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
有关何时应将列表理解视为一个不错的选择的更多信息,请参阅我的文章:
For more information on when list comprehensions should be considered a good option, see my writeup: For loops with pandas - When should I care?.
注意
日期和日期时间操作也具有矢量化版本.因此,例如,您应该更喜欢pd.to_datetime(df['date'])
例如df['date'].apply(pd.to_datetime)
.
Note
Date and datetime operations also have vectorized versions. So, for example, you should preferpd.to_datetime(df['date'])
, over, say,df['date'].apply(pd.to_datetime)
.
了解更多信息,请访问 文档.
Read more at the docs.
一个常见的陷阱:列表的爆炸列
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
人们倾向于使用apply(pd.Series)
.就性能而言,这太糟糕了.
People are tempted to use apply(pd.Series)
. This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
一个更好的选择是列出该列并将其传递给pd.DataFrame.
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
最后,
Lastly,
"有什么情况
apply
好吗?"
"Are there any situations where
apply
is good?"
Apply是一种便利功能,因此在 情况下,开销可以忽略不计,可以原谅.这实际上取决于调用该函数的次数.
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
已针对系列进行矢量化的功能,但未针对DataFrames
如果要对多个列应用字符串操作怎么办?如果要将多列转换为日期时间怎么办?这些函数仅对系列进行矢量化处理,因此必须在要转换/对其进行操作的每一列上应用.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
这是apply
的可接受情况:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
请注意,对于stack
还是有意义的,或者仅使用显式循环.所有这些选项都比使用apply
快一点,但是差别很小,可以原谅.
Note that it would also make sense to stack
, or just use an explicit loop. All these options are slightly faster than using apply
, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
对于其他操作(例如字符串操作或转换为类别),您可以进行类似的设置.
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
依此类推...
这似乎是API的特质.与使用astype
相比,使用apply
将Series中的整数转换为字符串是可比的(有时更快).
This seems like an idiosyncrasy of the API. Using apply
to convert integers in a Series to string is comparable (and sometimes faster) than using astype
.
该图是使用 perfplot
库绘制的.
The graph was plotted using the perfplot
library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
在使用浮点数时,我看到astype
始终与apply
一样快,或者比apply
快.因此,这与测试中的数据是整数类型有关.
With floats, I see the astype
is consistently as fast as, or slightly faster than apply
. So this has to do with the fact that the data in the test is integer type.
GroupBy.apply
,但是GroupBy.apply
还是一个迭代便捷函数,用于处理现有GroupBy
函数无法处理的任何事情.
GroupBy.apply
has not been discussed until now, but GroupBy.apply
is also an iterative convenience function to handle anything that the existing GroupBy
functions do not.
一个常见的要求是先执行GroupBy,然后执行两个主要操作,例如滞后的累积量":
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
您需要在此处进行两个连续的groupby呼叫:
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
使用apply
,您可以将其缩短为一个通话.
Using apply
, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
很难量化性能,因为它取决于数据.但是总的来说,如果目标是减少groupby
通话,则apply
是可以接受的解决方案(因为groupby
也很昂贵).
It is very hard to quantify the performance because it depends on the data. But in general, apply
is an acceptable solution if the goal is to reduce a groupby
call (because groupby
is also quite expensive).
除上述注意事项外,还值得一提的是apply
在第一行(或列)上运行两次.这样做是为了确定该功能是否有任何副作用.如果没有,apply
可能可以使用快速路径来评估结果,否则将退回到缓慢的实施方式.
Aside from the caveats mentioned above, it is also worth mentioning that apply
operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply
may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
在GroupBy.apply
的熊猫版本< 0.25中也可以看到此行为(已固定为0.25,有关更多信息,请参见此处.)
This behaviour is also seen in GroupBy.apply
on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
这篇关于我何时应该在代码中使用pandas apply()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!