我何时应该在代码中使用pandas apply()? [英] When should I ever want to use pandas apply() in my code?

查看:69
本文介绍了我何时应该在代码中使用pandas apply()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到许多有关使用Pandas方法apply的堆栈溢出问题的答案.我还看到用户在他们的下面发表评论,说"apply很慢,应避免使用".

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".

我已经阅读了许多有关性能主题的文章,这些文章解释了apply的运行速度很慢.我还在文档中看到了关于apply只是传递UDF的便捷函数的免责声明(现在似乎找不到).因此,通常的共识是,如果可能,应避免使用apply.但是,这引发了以下问题:

I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:

  1. 如果apply太糟糕了,那么为什么要在API中使用它呢?
  2. 如何以及何时使我的代码不使用apply?
  3. 在任何情况下,apply(比其他可能的解决方案更好)吗?
  1. If apply is so bad, then why is it in the API?
  2. How and when should I make my code apply-free?
  3. Are there ever any situations where apply is good (better than other possible solutions)?

推荐答案

apply,您不需要的便捷功能

我们首先在OP中逐一解决问题.

apply, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

"如果适用于太糟糕了,那为什么在API中呢?"

"If apply is so bad, then why is it in the API?"

Series.apply 是分别在DataFrame和Series对象上定义的便捷功能. apply接受任何在DataFrame上应用转换/聚合的用户定义函数. apply实际上是完成任何现有熊猫函数无法完成的工作的灵丹妙药.

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

apply可以做的一些事情:

  • 在DataFrame或Series上运行任何用户定义的函数
  • 在DataFrame上按行(axis=1)或按列(axis=0)应用函数
  • 在应用函数时执行索引对齐
  • 使用用户定义的函数执行聚合(但是,在这种情况下,我们通常更喜欢aggtransform)
  • 执行逐元素转换
  • 将汇总结果广播到原始行(请参见result_type参数).
  • 接受位置/关键字参数以传递给用户定义的函数.
  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_type argument).
  • Accept positional/keyword arguments to pass to the user-defined functions.

...等等.有关更多信息,请参见文档中的行或列函数应用程序.

...Among others. For more information, see Row or Column-wise Function Application in the documentation.

那么,有了所有这些功能,为什么apply不好? 是因为apply . Pandas不对您的函数的性质进行任何假设,因此将您的函数迭代地应用于必要的每一行/列.此外,处理上述所有情况都意味着apply在每次迭代时都会产生一些重大开销.此外,apply会消耗更多的内存,这对于内存受限的应用程序是一个挑战.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.

很少有适合使用apply的情况(以下更多内容). 如果不确定是否应使用apply,则可能不应该使用.

There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.

让我们解决下一个问题.

Let's address the next question.

"我应该如何以及何时使代码免费使用?"

换句话说,在某些常见情况下,您将摆脱apply的任何调用.

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

如果您正在使用数字数据,则可能已经有一个矢量化的cython函数可以完全实现您要执行的操作(如果不行,请在Stack Overflow上提问或在GitHub上打开功能请求)

If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

对比apply的性能,以执行简单的加法操作.

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

在性能方面,没有任何可比性,经过cythonized处理的等效项要快得多.不需要图表,因为即使对于玩具数据,差异也很明显.

Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

即使您启用了使用raw参数传递原始数组的速度,其速度仍然是原来的两倍.

Even if you enable passing raw arrays with the raw argument, it's still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

另一个例子:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通常,可能的话,寻找矢量化的替代方案.

Pandas在大多数情况下都提供矢量化"字符串函数,但是在极少数情况下,这些函数没有...应用",可以这么说.

Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.

一个常见的问题是检查同一行的另一列中是否存在一列中的值.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

这应该返回第二行和第三行,因为唐纳德"和米妮"分别出现在其标题"列中.

This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.

使用套用,将使用

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool

df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

但是,使用列表推导可以找到更好的解决方案.

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在这里要注意的是,由于开销较低,因此迭代例程的速度比apply快.如果需要处理NaN和无效的dtype,则可以使用自定义函数在此基础上进行构建,然后再使用列表推导中的参数进行调用.

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

有关何时应将列表理解视为一个不错的选择的更多信息,请参阅我的文章:

For more information on when list comprehensions should be considered a good option, see my writeup: For loops with pandas - When should I care?.

注意
日期和日期时间操作也具有矢量化版本.因此,例如,您应该更喜欢pd.to_datetime(df['date']) 例如df['date'].apply(pd.to_datetime).

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

了解更多信息,请访问 文档.

Read more at the docs.

一个常见的陷阱:列表的爆炸列

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

人们倾向于使用apply(pd.Series).就性能而言,这太糟糕了.

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

一个更好的选择是列出该列并将其传递给pd.DataFrame.

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


最后,


Lastly,

"有什么情况 apply 好吗?"

"Are there any situations where apply is good?"

Apply是一种便利功能,因此在 情况下,开销可以忽略不计,可以原谅.这实际上取决于调用该函数的次数.

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

已针对系列进行矢量化的功能,但未针对DataFrames
如果要对多个列应用字符串操作怎么办?如果要将多列转换为日期时间怎么办?这些函数仅对系列进行矢量化处理,因此必须在要转换/对其进行操作的每一列上应用.

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object

这是apply的可接受情况:

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

请注意,对于stack还是有意义的,或者仅使用显式循环.所有这些选项都比使用apply快一点,但是差别很小,可以原谅.

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于其他操作(例如字符串操作或转换为类别),您可以进行类似的设置.

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

依此类推...

这似乎是API的特质.与使用astype相比,使用apply将Series中的整数转换为字符串是可比的(有时更快).

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

该图是使用 perfplot 库绘制的.

The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

在使用浮点数时,我看到astype始终与apply一样快,或者比apply快.因此,这与测试中的数据是整数类型有关.

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.

GroupBy.apply,但是GroupBy.apply还是一个迭代便捷函数,用于处理现有GroupBy函数无法处理的任何事情.

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

一个常见的要求是先执行GroupBy,然后执行两个主要操作,例如滞后的累积量":

One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

您需要在此处进行两个连续的groupby呼叫:

You'd need two successive groupby calls here:

df.groupby('A').B.cumsum().groupby(df.A).shift()

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

使用apply,您可以将其缩短为一个通话.

Using apply, you can shorten this to a a single call.

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

很难量化性能,因为它取决于数据.但是总的来说,如果目标是减少groupby通话,则apply是可以接受的解决方案(因为groupby也很昂贵).

It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).

除上述注意事项外,还值得一提的是apply在第一行(或列)上运行两次.这样做是为了确定该功能是否有任何副作用.如果没有,apply可能可以使用快速路径来评估结果,否则将退回到缓慢的实施方式.

Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

GroupBy.apply的熊猫版本< 0.25中也可以看到此行为(已固定为0.25,有关更多信息,请参见此处.)

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

这篇关于我何时应该在代码中使用pandas apply()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆