检查dataframe列中的所有值是否都相同 [英] Check if all values in dataframe column are the same

查看:1201
本文介绍了检查dataframe列中的所有值是否都相同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想快速轻松地检查counts的所有列值在数据框中是否相同:

I want to do a quick and easy check if all column values for counts are the same in a dataframe:

在:

import pandas as pd

d = {'names': ['Jim', 'Ted', 'Mal', 'Ted'], 'counts': [3, 4, 3, 3]}
pd.DataFrame(data=d)

出局:

  names  counts
0   Jim       3
1   Ted       4
2   Mal       3
3   Ted       3

我只想要一个简单的条件,先if all counts = same value然后print('True').

I want just a simple condition that if all counts = same value then print('True').

有快速的方法吗?

推荐答案

一种有效的方法是将第一个值与其余值进行比较,并使用

An efficient way to do this is by comparing the first value with the rest, and using all:

def is_unique(s):
    a = s.to_numpy() # s.values (pandas<0.24)
    return (a[0] == a[1:]).all()

is_unique(df['counts'])
# False


对于整个数据框

如果要在整个数据帧上执行相同的任务,我们可以通过在all中设置axis=0来扩展上述内容:


For an entire dataframe

In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting axis=0 in all:

def unique_cols(df):
    a = df.to_numpy() # df.values (pandas<0.24)
    return (a[0] == a[1:]).all(0)

对于共享示例,我们将得到:

For the shared example, we'd get:

unique_cols(df)
# array([False, False])


与其他一些方法(例如,使用nunique(对于 pd.Series ))相比,这是上述方法的基准:


Here's a benchmark of the above methods compared with some other approaches, such as using nunique (for a pd.Series):

s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))

perfplot.show(
    setup=lambda n: s_num.iloc[:int(n)], 

    kernels=[
        lambda s: s.nunique() == 1,
        lambda s: is_unique(s)
    ],

    labels=['nunique', 'first_vs_rest'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

下面是 pd.DataFrame 的计时.我们也将它与numba方法进行比较,这在这里特别有用,因为一旦在给定的列中看到重复的值,我们就可以利用捷径(注:numba方法仅适用于数字数据):

And bellow are the timings for a pd.DataFrame. Let's compare too with a numba approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column (note: the numba approach will only work with numerical data):

from numba import njit

@njit
def unique_cols_nb(a):
    n_cols = a.shape[1]
    out = np.zeros(n_cols, dtype=np.int32)
    for i in range(n_cols):
        init = a[0, i]
        for j in a[1:, i]:
            if j != init:
                break
        else:
            out[i] = 1
    return out

如果我们比较三种方法:

If we compare the three methods:

df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)), 
                                  np.zeros((500_000, 10))], axis=1))

perfplot.show(
    setup=lambda n: df.iloc[:int(n),:], 

    kernels=[
        lambda df: (df.nunique(0) == 1).values,
        lambda df: unique_cols_nb(df.values).astype(bool),
        lambda df: unique_cols(df) 
    ],

    labels=['nunique', 'unique_cols_nb', 'unique_cols'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

这篇关于检查dataframe列中的所有值是否都相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆