pandas groupby.size vs series.value_counts vs collections.具有多个系列的计数器 [英] Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

查看:64
本文介绍了 pandas groupby.size vs series.value_counts vs collections.具有多个系列的计数器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多问题( 1 2 1 2 ),但何时以及为什么不应该使用它们呢?

以下是对三种潜在方法的一些基准测试.我有两个具体问题:

  1. 为什么groupercount更有效?我希望count效率更高,因为它是用C语言实现的.即使列数从2增加到4,grouper的优越性能仍然存在.
  2. 为什么value_counter的表现不如grouper这么大?这是由于构建列表或从列表创建序列的成本所致吗?

我了解输出是不同的,这也应该有助于选择.例如,对于连续的numpy数组,按计数进行过滤比对字典进行理解更为有效:

x, z = grouper(df), count(df)
%timeit x[x.values > 10]                        # 749µs
%timeit {k: v for k, v in z.items() if v > 10}  # 9.37ms

但是,我的问题的重点是在构建可比结果的性能(相对于系列词典).我的C语言知识是有限的,但是我希望您能提出任何指向这些方法背后逻辑的答案.

基准代码

import pandas as pd
import numpy as np
from collections import Counter

np.random.seed(0)

m, n = 1000, 100000

df = pd.DataFrame({'A': np.random.randint(0, m, n),
                   'B': np.random.randint(0, m, n)})

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def value_counter(df):
    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)

def count(df):
    return Counter(zip(df.A.values, df.B.values))

x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)

assert (x == y) & (y == z), "Dictionary mismatch!"

for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit value_counter(df)
    %timeit count(df)

基准化结果

在python 3.6.2,pandas 0.20.3,numpy 1.13.1上运行

机器规格:Windows 7 64位,双核2.5 GHz,4GB RAM.

键:g = grouper,v = value_counter,c = count.

m           n        g        v       c
100     10000     2.91    18.30    8.41
1000    10000     4.10    27.20    6.98[1]
100    100000    17.90   130.00   84.50
1000   100000    43.90   309.00   93.50

1 这不是错字.

解决方案

zip(df.A.values, df.B.values)中实际上有一些隐藏的开销.这里的关键归结为将numpy数组以与Python对象根本不同的方式存储在内存中.

numpy数组(例如np.arange(10))本质上存储为连续的内存块,而不是单独的Python对象.相反,Python列表(例如list(range(10)))作为指向各个Python对象(即整数0-9)的指针存储在内存中.这种差异是为什么numpy数组的内存小于Python等效列表的原因,以及为什么可以对numpy数组执行更快的计算的原因.

因此,由于Counter正在消耗zip,因此需要将关联的元组创建为Python对象.这意味着Python需要从numpy数据中提取元组值,并在内存中创建相应的Python对象.这有明显的开销,这就是为什么在将纯Python函数与numpy数据结合使用时要非常小心.您可能会经常看到的这种陷阱的基本示例是在numpy数组上使用内置的Python sum:sum(np.arange(10**5))实际上比纯Python sum(range(10**5))慢一点,当然两者都明显比np.sum(np.arange(10**5))慢.

请参见此视频,以详细了解该主题.

作为此问题的示例,请观察以下时间,比较Counter在压缩的numpy数组与相应的压缩的Python列表上的性能.

In [2]: a = np.random.randint(10**4, size=10**6)
   ...: b = np.random.randint(10**4, size=10**6)
   ...: a_list = a.tolist()
   ...: b_list = b.tolist()

In [3]: %timeit Counter(zip(a, b))
455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit Counter(zip(a_list, b_list))
334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这两个时间之间的差异使您可以合理地估算前面讨论的开销.

这还不是故事的结局.在熊猫中构造groupby对象也涉及一些开销,至少与该问题有关,因为存在一些groupby元数据并不一定只是获取size所必需,而Counter仅做一个你关心的事情.通常,此开销远小于与Counter关联的开销,但是通过一些快速实验,我发现当大多数组仅由单个元素组成时,您实际上可以从Counter获得稍微更好的性能.

请考虑以下几个大群体的时间安排(使用@BallpointBen的sort=False建议)<->许多小群体:

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def count(df):
    return Counter(zip(df.A.values, df.B.values))

for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit count(df)

哪个给我下表?

m       grouper   counter
10      62.9 ms    315 ms
10**3    191 ms    535 ms
10**7    514 ms    459 ms

当然,如果您要作为最终对象,那么从Counter获得的任何收益都可以通过转换回Series来抵消.

There are many questions (1, 2, 3) dealing with counting values in a single series.

However, there are fewer questions looking at the best way to count combinations of two or more series. Solutions are presented (1, 2), but when and why one should use each is not discussed.

Below is some benchmarking for three potential methods. I have two specific questions:

  1. Why is grouper more efficient than count? I expected count to be the more efficient, as it is implemented in C. The superior performance of grouper persists even if number of columns is increased from 2 to 4.
  2. Why does value_counter underperform grouper by so much? Is this due to the cost of constructing a list, or series from list?

I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:

x, z = grouper(df), count(df)
%timeit x[x.values > 10]                        # 749µs
%timeit {k: v for k, v in z.items() if v > 10}  # 9.37ms

However, the focus of my question is on performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.

Benchmarking code

import pandas as pd
import numpy as np
from collections import Counter

np.random.seed(0)

m, n = 1000, 100000

df = pd.DataFrame({'A': np.random.randint(0, m, n),
                   'B': np.random.randint(0, m, n)})

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def value_counter(df):
    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)

def count(df):
    return Counter(zip(df.A.values, df.B.values))

x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)

assert (x == y) & (y == z), "Dictionary mismatch!"

for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit value_counter(df)
    %timeit count(df)

Benchmarking results

Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1

Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.

Key: g = grouper, v = value_counter, c = count.

m           n        g        v       c
100     10000     2.91    18.30    8.41
1000    10000     4.10    27.20    6.98[1]
100    100000    17.90   130.00   84.50
1000   100000    43.90   309.00   93.50

1 This is not a typo.

解决方案

There's actually a bit of hidden overhead in zip(df.A.values, df.B.values). The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects.

A numpy array, such as np.arange(10), is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, such as list(range(10)), is stored in memory as pointers to individual Python objects (i.e. integers 0-9). This difference is the basis for why numpy arrays are smaller in memory than the Python equivalent lists, and why you can perform faster computations on numpy arrays.

So, as Counter is consuming the zip, the associated tuples need to be created as Python objects. This means that Python needs to extract the tuple values from numpy data and create corresponding Python objects in memory. There is noticeable overhead to this, which is why you want to be very careful when combining pure Python functions with numpy data. A basic example of this pitfall that you might commonly see is using the built-in Python sum on a numpy array: sum(np.arange(10**5)) is actually a bit slower than the pure Python sum(range(10**5)), and both of which are of course significantly slower than np.sum(np.arange(10**5)).

See this video for a more in depth discussion of this topic.

As an example specific to this question, observe the following timings comparing the performance of Counter on zipped numpy arrays vs. the corresponding zipped Python lists.

In [2]: a = np.random.randint(10**4, size=10**6)
   ...: b = np.random.randint(10**4, size=10**6)
   ...: a_list = a.tolist()
   ...: b_list = b.tolist()

In [3]: %timeit Counter(zip(a, b))
455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit Counter(zip(a_list, b_list))
334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference between these two timings gives you a reasonable estimate of the overhead discussed earlier.

This isn't quite the end of the story though. Constructing a groupby object in pandas involves a some overhead too, at least as related to this problem, since there's some groupby metadata that isn't strictly necessary just to get size, whereas Counter does the one singular thing you care about. Usually this overhead is far less than the overhead associated with Counter, but from some quick experimentation I've found that you can actually get marginally better performance from Counter when the majority of your groups just consist of single elements.

Consider the following timings (using @BallpointBen's sort=False suggestion) that go along the spectrum of few large groups <--> many small groups:

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def count(df):
    return Counter(zip(df.A.values, df.B.values))

for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit count(df)

Which gives me the following table:

m       grouper   counter
10      62.9 ms    315 ms
10**3    191 ms    535 ms
10**7    514 ms    459 ms

Of course, any gains from Counter would be offset by converting back to a Series, if that's what you want as your final object.

这篇关于 pandas groupby.size vs series.value_counts vs collections.具有多个系列的计数器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆