使用Python Pandas从数据框中获取总values_count [英] Get total values_count from a dataframe with Python Pandas

查看:606
本文介绍了使用Python Pandas从数据框中获取总values_count的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有几列的Python pandas数据框.现在,我想将所有值复制到单个列中,以获取一个values_count结果alle值包括在内.最后,我需要string1,string2,n的总数.最好的方法是什么?

I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?

index row 1    row 2   ...
0     string1  string3
1     string1  string1
2     string2  string2
...

推荐答案

如果性能存在问题,请尝试:

If performance is an issue try:

from collections import Counter

Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})

stack将其合并为一个Series,然后使用value_counts

Or stack it into one Series then use value_counts

df.stack().value_counts()
#string1    3
#string2    2
#string3    1
#dtype: int64

对于列数较少的较大(长)DataFrame,循环可能比堆栈快:

For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:

s = pd.Series()
for col in df.columns:
    s = s.add(df[col].value_counts(), fill_value=0)

#string1    3.0
#string2    2.0
#string3    1.0
#dtype: float64

此外,还有一个小小的解决方案:

Also, there's a numpy solution:

import numpy as np
np.unique(df.values, return_counts=True)

#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))


df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
                   'row2': ['string3', 'string1', 'string2']})

def vc_from_loop(df):
    s = pd.Series()
    for col in df.columns:
        s = s.add(df[col].value_counts(), fill_value=0)
    return s

DataFrame

%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.unique(df.values, return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

DataFrame

df = pd.concat([df]*300000, ignore_index=True)

%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.unique(df.values, return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这篇关于使用Python Pandas从数据框中获取总values_count的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆