Pandas DataFrame检查列值是否存在于一组列中 [英] Pandas DataFrame check if column value exists in a group of columns

查看:1987
本文介绍了Pandas DataFrame检查列值是否存在于一组列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的DataFrame(简化示例)

I have a DataFrame like this (simplified example)

id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101

并希望创建一个附加列,该列为1或0.如果v0值在v1到v4的值中,则为1,否则为0.因此,在此示例中,对于id 1,则该值应为1(因为v2 = 10),而对于id 2,该值应为0,因为22不在v1至v4中.

And would like to create an additional column that is either 1 or 0. 1 if v0 value is in the values of v1 to v4, and 0 if it's not. So, in this example for id 1 then the value should be 1 (since v2 = 10) and for id 2 value should be 0 since 22 is not in v1 thru v4.

实际上,表要大得多(从v1到v99大约有100,000行和变量).

In reality the table is way bigger (around 100,000 rows and variables go from v1 to v99).

推荐答案

您可以使用基础numpy数组来提高性能:

You can use the underlying numpy arrays for performance:

设置

Setup

a = df.v0.values
b = df.iloc[:, 2:].values


df.assign(out=(a[:, None]==b).any(1).astype(int))

   id  v0  v1  v2   v3   v4  out
0   1  10   5  10   22   50    1
1   2  22  23  55   60   50    0
2   3   8   2  40   80  110    0
3   4  15  15  25  100  101    1


此解决方案利用广播来进行成对比较:


This solution leverages broadcasting to allow for pairwise comparison:

首先,我们广播a:

>>> a[:, None]
array([[10],
       [22],
       [ 8],
       [15]], dtype=int64)

允许与b成对比较:

>>> a[:, None] == b
array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]])

然后我们仅需检查沿第一个轴的任何True结果,并将其转换为整数即可.

We then simply check for any True results along the first axis, and convert to integer.

性能

Performance

功能

Functions

def user_chris(df):
    a = df.v0.values
    b = df.iloc[:, 2:].values
    return (a[:, None]==b).any(1).astype(int)

def rahlf23(df):
    df = df.set_index('id')
    return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

def chris_a(df):
    return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)

def chris(df):
    return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

def anton_vbr(df):
    df.set_index('id', inplace=True)
    return df.isin(df.pop('v0')).any(1).astype(int)

设置

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit

res = pd.DataFrame(
       index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
       columns=[10, 50, 100, 500, 1000, 5000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        vals = np.random.randint(1, 100, (c, c))
        vals = np.column_stack((np.arange(vals.shape[0]), vals))
        df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

输出

Output

这篇关于Pandas DataFrame检查列值是否存在于一组列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆