pandas 中的 for 循环真的很糟糕吗?我什么时候应该关心? [英] Are for-loops in pandas really bad? When should I care?

查看:22
本文介绍了 pandas 中的 for 循环真的很糟糕吗?我什么时候应该关心?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

for 循环真的不好"吗?如果不是,在什么情况下它们会比使用更传统的矢量化"方法更好?1

我熟悉矢量化"的概念,以及 Pandas 如何使用矢量化技术来加速计算.向量化函数在整个系列或 DataFrame 上广播操作,以实现比传统迭代数据大得多的加速.

然而,我很惊讶地看到很多代码(包括来自 Stack Overflow 上的答案)为涉及使用 for 循环和列表推导式遍历数据的问题提供解决方案.文档和 API 说循环是坏的",并且应该永远不要"遍历数组、系列或数据帧.那么,为什么我有时会看到用户建议基于循环的解决方案?

<小时>

1 - 虽然这个问题听起来确实有些宽泛,但事实是,在某些非常特殊的情况下,for 循环通常比传统的数据迭代更好.这篇文章旨在为后代捕捉这一点.

解决方案

TLDR;不,for 循环并不是一概而论的坏",至少,并非总是如此.说某些矢量化操作比迭代慢可能更准确,而不是说迭代比某些矢量化操作快.知道何时以及为什么是从代码中获得最大性能的关键.简而言之,这些是值得考虑替代矢量化熊猫函数的情况:

  1. 当您的数据很少时(...取决于您在做什么),
  2. 当处理object/mixed dtypes
  3. 使用 str/regex 访问器函数时

让我们分别检查这些情况.

<小时>

小数据上的迭代v/s向量化

Pandas 在其 API 设计中遵循

对于中等大小的 N,列表理解优于 query,甚至在很小的 N 时优于矢量化不等于比较.不幸的是,列表理解是线性扩展的,因此它没有为较大的 N 提供太多的性能提升

<块引用>

注意
值得一提的是,列表理解的大部分好处来自不必担心索引对齐,但这意味着如果您的代码依赖于索引对齐,这将打破.在某些情况下,向量化操作底层的 NumPy 数组可以被认为是带来了最好的两个世界",允许矢量化而没有熊猫函数的所有不必要的开销.这意味着您可以将上面的操作重写为

df[df.A.values != df.B.values]

哪个优于熊猫和列表理解等价物:

NumPy 矢量化超出了本文的范围,但如果性能很重要,它绝对值得考虑.

价值计数
再举一个例子——这一次,使用另一个比 for 循环更快的普通 python 结构——

结果更加明显,Counter 在更大范围的小 N (~3500) 方面胜过两种矢量化方法.

<块引用>

注意
更多琐事(礼貌@user2357112).Counter 使用

Numba 提供了将循环 Python 代码 JIT 编译为非常强大的矢量化代码的功能.了解如何使 numba 发挥作用涉及一个学习曲线.

<小时>

具有混合/object dtypes 的操作

基于字符串的比较
回顾第一节中的过滤示例,如果要比较的列是字符串怎么办?考虑上面相同的 3 个函数,但输入 DataFrame 转换为字符串.

# 布尔索引与字符串值比较.df[df.A != df.B] # 向量化 !=df.query('A != B') # 查询 (numexpr)df[[x != y for x, y in zip(df.A, df.B)]] # list comp

那么,发生了什么变化?这里需要注意的是,字符串操作本质上很难矢量化. Pandas 将字符串视为对象,所有对对象的操作都会回退到缓慢、循环的实现.

现在,因为这个循环实现被上面提到的所有开销所包围,所以这些解决方案之间存在恒定的幅度差异,即使它们的规模相同.

说到对可变/复杂对象的操作,没有可比性.列表理解优于所有涉及字典和列表的操作.

通过键访问字典值
以下是从字典列中提取值的两个操作的时间:map 和列表推导.设置位于附录中的代码片段"标题下.

# 字典值提取.ser.map(operator.itemgetter('value')) # 映射pd.Series([x.get('value') for x in ser]) # 列表推导

位置列表索引
从列列表中提取第 0 个元素的 3 个操作的时间(处理异常),

列表扁平化
最后一个例子是扁平化列表.这是另一个常见问题,展示了纯 Python 的强大之处.

# 嵌套列表扁平化.pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stackpd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chainpd.Series([y for x in ser for y in x]) # 嵌套列表 comp

两者都

更多示例
完全公开 - 我是下面列出的这些帖子的作者(部分或全部).

<小时>

结论

如上面的示例所示,迭代在处理小行数据帧、混合数据类型和正则表达式时表现出色.

您获得的加速取决于您的数据和您的问题,因此您的里程可能会有所不同.最好的办法是仔细运行测试,看看付出的努力是否值得.

矢量化"函数以其简单性和可读性而大放异彩,因此如果性能不重要,您绝对应该更喜欢那些.

另一方面,某些字符串操作处理有利于使用 NumPy 的约束.下面是两个例子,其中仔细的 NumPy 矢量化优于 python:

此外,有时仅通过 .values 对底层数组进行操作,而不是在 Series 或 DataFrames 上操作可以为大多数常见场景提供足够健康的加速(请参阅注意 在上面的数值比较部分).因此,例如 df[df.A.values != df.B.values] 将显示比 df[df.A != df.B] 的即时性能提升.使用 .values 可能并不适用于所有情况,但它是一个有用的技巧.

如上所述,这些解决方案是否值得为实施而费心由您来决定.

<小时>

附录:代码片段

import perfplot进口经营者将熊猫导入为 pd将 numpy 导入为 np进口重新从集合导入计数器从 itertools 导入链

# 布尔索引与数值比较.perfplot.show(setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),内核=[拉姆达 df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],lambda df: df[get_mask(df.A.values, df.B.values)]],标签=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],n_range=[2**k for k in range(0, 15)],xlabel='N')

# 值计数比较.perfplot.show(setup=lambda n: pd.Series(np.random.choice(1000, n)),内核=[lambda ser: ser.value_counts(sort=False).to_dict(),lambda ser: dict(zip(*np.unique(ser, return_counts=True))),lambda ser:计数器(ser),],标签=['value_counts', 'np.unique', 'Counter'],n_range=[2**k for k in range(0, 15)],xlabel='N',equals_check=lambda x, y: dict(x) == dict(y))

# 布尔索引与字符串值比较.perfplot.show(setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),内核=[拉姆达 df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],],标签=['vectorized !=', 'query (numexpr)', 'list comp'],n_range=[2**k for k in range(0, 15)],xlabel='N',平等检查=无)

# 字典值提取.ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])perfplot.show(setup=lambda n: pd.concat([ser1] * n, ignore_index=True),内核=[lambda ser: ser.map(operator.itemgetter('value')),lambda ser: pd.Series([x.get('value') for x in ser]),],标签=['地图','列表理解'],n_range=[2**k for k in range(0, 15)],xlabel='N',平等检查=无)

# 列出位置索引.ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])perfplot.show(setup=lambda n: pd.concat([ser2] * n, ignore_index=True),内核=[lambda ser: ser.map(get_0th),lambda ser: ser.str[0],lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),lambda ser: pd.Series([get_0th(x) for x in ser]),],标签=['map', 'str accessor', 'list comprehension', 'list comp safe'],n_range=[2**k for k in range(0, 15)],xlabel='N',平等检查=无)

# 嵌套列表扁平化.ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])perfplot.show(setup=lambda n: pd.concat([ser2] * n, ignore_index=True),内核=[lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),lambda ser: pd.Series([y for x in ser for y in x]),],标签=['stack', 'itertools.chain', 'nested list comp'],n_range=[2**k for k in range(0, 15)],xlabel='N',平等检查=无)

# 提取字符串.ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])perfplot.show(setup=lambda n: pd.concat([ser4] * n, ignore_index=True),内核=[lambda ser: ser.str.extract(r'(?<=[A-Z])(d{4})', expand=False),lambda ser: pd.Series([matcher(x) for x in ser])],标签=['str.extract', '列表理解'],n_range=[2**k for k in range(0, 15)],xlabel='N',平等检查=无)

Are for loops really "bad"? If not, in what situation(s) would they be better than using a more conventional "vectorized" approach?1

I am familiar with the concept of "vectorization", and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.

However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are "bad", and that one should "never" iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?


1 - While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.

解决方案

TLDR; No, for loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (...depending on what you're doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let's examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What's even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop - collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don't give you the performance you need, there is always cython and numba. Let's add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it's that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other "vectorized" string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python's re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it's up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

这篇关于 pandas 中的 for 循环真的很糟糕吗?我什么时候应该关心?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆