具有多个系列的 Pandas groupby.size vs series.value_counts vs collections.Counter [英] Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

查看:23
本文介绍了具有多个系列的 Pandas groupby.size vs series.value_counts vs collections.Counter的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题很多(123) 处理单个中的计数值系列.

There are many questions (1, 2, 3) dealing with counting values in a single series.

但是,关于计算两个或多个系列的组合的最佳方法的问题较少.提出了解决方案(1, 2),但是当和没有讨论为什么应该使用每一个.

However, there are fewer questions looking at the best way to count combinations of two or more series. Solutions are presented (1, 2), but when and why one should use each is not discussed.

以下是三种潜在方法的一些基准测试.我有两个具体问题:

Below is some benchmarking for three potential methods. I have two specific questions:

  1. 为什么groupercount 更有效率?我希望 count 效率更高,因为它是用 C 实现的.即使列数从 2 增加到 4,grouper 的卓越性能仍然存在.莉>
  2. 为什么value_counter 的表现比grouper 差这么多?这是由于构建列表或列表中的系列的成本吗?
  1. Why is grouper more efficient than count? I expected count to be the more efficient, as it is implemented in C. The superior performance of grouper persists even if number of columns is increased from 2 to 4.
  2. Why does value_counter underperform grouper by so much? Is this due to the cost of constructing a list, or series from list?

我知道输出是不同的,这也应该告知选择.例如,与字典理解相比,使用连续的 numpy 数组按计数过滤更有效:

I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:

x, z = grouper(df), count(df)
%timeit x[x.values > 10]                        # 749µs
%timeit {k: v for k, v in z.items() if v > 10}  # 9.37ms

然而,我的问题的重点是在系列与字典中构建可比较结果的性能.我的 C 知识有限,但我很感激任何能指出这些方法背后逻辑的答案.

However, the focus of my question is on performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.

基准代码

import pandas as pd
import numpy as np
from collections import Counter

np.random.seed(0)

m, n = 1000, 100000

df = pd.DataFrame({'A': np.random.randint(0, m, n),
                   'B': np.random.randint(0, m, n)})

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def value_counter(df):
    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)

def count(df):
    return Counter(zip(df.A.values, df.B.values))

x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)

assert (x == y) & (y == z), "Dictionary mismatch!"

for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit value_counter(df)
    %timeit count(df)

基准测试结果

在 python 3.6.2、pandas 0.20.3、numpy 1.13.1 上运行

Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1

机器规格:Windows 7 64 位,双核 2.5 GHz,4GB RAM.

Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.

键:g = grouper, v = value_counter, c = count.

Key: g = grouper, v = value_counter, c = count.

m           n        g        v       c
100     10000     2.91    18.30    8.41
1000    10000     4.10    27.20    6.98[1]
100    100000    17.90   130.00   84.50
1000   100000    43.90   309.00   93.50

1 这不是打字错误.

推荐答案

zip(df.A.values, df.B.values) 实际上有一些隐藏的开销.这里的关键归结为 numpy 数组以与 Python 对象完全不同的方式存储在内存中.

There's actually a bit of hidden overhead in zip(df.A.values, df.B.values). The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects.

一个 numpy 数组,例如 np.arange(10),本质上存储为连续的内存块,而不是单独的 Python 对象.相反,Python 列表,例如 list(range(10)),作为指向单个 Python 对象(即整数 0-9)的指针存储在内存中.这种差异是为什么 numpy 数组在内存中比 Python 等效列表更小的基础,以及为什么您可以对 numpy 数组执行更快的计算.

A numpy array, such as np.arange(10), is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, such as list(range(10)), is stored in memory as pointers to individual Python objects (i.e. integers 0-9). This difference is the basis for why numpy arrays are smaller in memory than the Python equivalent lists, and why you can perform faster computations on numpy arrays.

因此,由于 Counter 正在使用 zip,因此需要将关联的元组创建为 Python 对象.这意味着 Python 需要从 numpy 数据中提取元组值并在内存中创建相应的 Python 对象.这有明显的开销,这就是为什么在将纯 Python 函数与 numpy 数据结合时要非常小心的原因.您可能经常看到的这个陷阱的一个基本示例是在 numpy 数组上使用内置的 Python sum:sum(np.arange(10**5))实际上比纯 Python sum(range(10**5)) 慢一点,当然两者都比 np.sum(np.arange(10**5)).

So, as Counter is consuming the zip, the associated tuples need to be created as Python objects. This means that Python needs to extract the tuple values from numpy data and create corresponding Python objects in memory. There is noticeable overhead to this, which is why you want to be very careful when combining pure Python functions with numpy data. A basic example of this pitfall that you might commonly see is using the built-in Python sum on a numpy array: sum(np.arange(10**5)) is actually a bit slower than the pure Python sum(range(10**5)), and both of which are of course significantly slower than np.sum(np.arange(10**5)).

请参阅此视频,更深入地讨论该主题.

See this video for a more in depth discussion of this topic.

作为此问题的具体示例,请观察以下时序,比较 Counter 在压缩的 numpy 数组与相应的压缩 Python 列表上的性能.

As an example specific to this question, observe the following timings comparing the performance of Counter on zipped numpy arrays vs. the corresponding zipped Python lists.

In [2]: a = np.random.randint(10**4, size=10**6)
   ...: b = np.random.randint(10**4, size=10**6)
   ...: a_list = a.tolist()
   ...: b_list = b.tolist()

In [3]: %timeit Counter(zip(a, b))
455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit Counter(zip(a_list, b_list))
334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这两个时间之间的差异使您可以对前面讨论的开销进行合理的估计.

The difference between these two timings gives you a reasonable estimate of the overhead discussed earlier.

这还不是故事的结局.在 Pandas 中构造一个 groupby 对象也涉及一些开销,至少与这个问题相关,因为有一些 groupby 元数据并不是获取 size,而 Counter 只做你关心的一件事情.通常,这种开销远小于与 Counter 相关的开销,但是通过一些快速实验,我发现当大多数您的组仅由单个元素组成.

This isn't quite the end of the story though. Constructing a groupby object in pandas involves a some overhead too, at least as related to this problem, since there's some groupby metadata that isn't strictly necessary just to get size, whereas Counter does the one singular thing you care about. Usually this overhead is far less than the overhead associated with Counter, but from some quick experimentation I've found that you can actually get marginally better performance from Counter when the majority of your groups just consist of single elements.

考虑以下时间安排(使用@BallpointBen 的sort=False 建议),这些时间属于少数大型团体<--> 许多小型团体的范围:

Consider the following timings (using @BallpointBen's sort=False suggestion) that go along the spectrum of few large groups <--> many small groups:

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def count(df):
    return Counter(zip(df.A.values, df.B.values))

for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit count(df)

这给了我下表:

m       grouper   counter
10      62.9 ms    315 ms
10**3    191 ms    535 ms
10**7    514 ms    459 ms

当然,如果您希望将 Counter 转换回 Series,那么任何来自 Counter 的收益都会被抵消.

Of course, any gains from Counter would be offset by converting back to a Series, if that's what you want as your final object.

这篇关于具有多个系列的 Pandas groupby.size vs series.value_counts vs collections.Counter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆