Pandas DataFrame根据条件进行分组 [英] Pandas DataFrame groupby based on condition

查看:438
本文介绍了Pandas DataFrame根据条件进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现的最相似的问题是此处,但没有正确的答案

The most similar question I found was here but with no proper answer.

基本上,我在尝试在数据帧上使用groupby来生成公交路线的唯一ID时遇到问题.问题是,我可以随时使用的数据(尽管很少)的groupby列具有相同的值,因此即使它们不是同一总线,它们也被视为同一总线.

Basically I have an issue where I'm trying to use groupby on a dataframe to generate unique IDs for bus routes. The problem is, the data I have at my disposal sometimes (though rarely) has the same values for my groupby columns, so they're considered the same bus even though they aren't.

我唯一想到的另一种方法是基于称为停止类型"的另一列对总线进行分组,其中有一个指示开始",中间"和结束"的指示器.我想使用groupby根据此列创建组,其中每个组从停止类型" =开始的地方开始,到停止类型" =结束的地方结束.

The only other way I can think of is to group buses based on another column called "Type of stop", where there is an indicator for Start, Middle and End. I'd like to use groupby to create groups based on this column where each group starts where "type of stop" = Start, and ends where "type of stop" = End.

请考虑以下数据:

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
    'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3)})

   Cond   Position
0     A   START
1     A   MID  
2     A   MID   
3     A   END    
4     A   MID    
5     A   START   
6     A   START   
7     A   MID    
8     A   MID    
9     A   END    
10    A   MID   
11    A   START    
12    A   START    
13    A   MID    
14    A   MID    
15    A   END     
16    A   MID    
17    A   START

我想出的将这些总线准确地分组在一起的唯一方法是生成带有总线序列ID的附加列,但是鉴于我要处理大量数据,所以这不是一个非常有效的解决方案.我希望能够通过一个groupby来管理我想做的事情,以便生成以下输出

The only way I came up with to accurately group these buses together is to generate an additional column with the bus sequence id, but given that I'm working with lots of data, this isn't a very efficient solution. I'm hoping to be able to manage what I want to do with a single groupby, if possible, in order to generate the following output

   Cond   Position   Group
0     A   START      1
1     A   MID        1
2     A   MID        1
3     A   END        1
4     A   MID        
5     A   START      2
6     A   START      2
7     A   MID        2
8     A   MID        2
9     A   END        2 
10    A   MID        
11    A   START      3
12    A   START      3 
13    A   MID        3
14    A   MID        3
15    A   END        3 
16    A   MID        
17    A   START      4

推荐答案

一个想法是通过np.select分解,然后通过numba使用自定义循环:

One idea is to factorize via np.select, then use a custom loop via numba:

from numba import njit

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
                   'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})

@njit
def grouper(pos):
    res = np.empty(pos.shape)
    num = 1
    started = 0
    for i in range(len(res)):
        current_pos = pos[i]
        if (started == 0) and (current_pos == 0):
            started = 1
            res[i] = num
        elif (started == 1) and (current_pos == 1):
            started = 0
            res[i] = num
            num += 1
        elif (started == 1) and (current_pos in [-1, 0]):
            res[i] = num
        else:
            res[i] = 0
    return res

arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)

df['Group'] = grouper(arr).astype(int)

结果:

print(df)

   Position Vehicle_ID  Group
0     START          A      1
1       MID          A      1
2       MID          A      1
3       END          A      1
4       MID          A      0
5     START          A      2
6     START          A      2
7       MID          A      2
8       MID          A      2
9       END          A      2
10      MID          A      0
11    START          A      3
12    START          A      3
13      MID          A      3
14      MID          A      3
15      END          A      3
16      MID          A      0
17    START          A      4

我认为,您不应包含空白"值,因为这将迫使您的系列成为object dtype,对于任何后续处理均效率低下.如上所述,您可以改用0.

In my opinion, you should not include "blank" values as this would force your series to be object dtype, inefficient for any subsequent processing. As above, you can use 0 instead.

numba比一种纯熊猫方法快约10倍:-

numba is around ~10x faster than one pure Pandas approach:-

import pandas as pd, numpy as np
from numba import njit

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
                   'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})


df = pd.concat([df]*10, ignore_index=True)

assert joz(df.copy()).equals(jpp(df.copy()))

%timeit joz(df.copy())  # 18.6 ms per loop
%timeit jpp(df.copy())  # 1.95 ms per loop

基准功能:

def joz(df):
    # identification of sequences
    df['Position_Prev'] = df['Position'].shift(1)
    df['Sequence'] = 0
    df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
    df.loc[df['Position'] == 'END', 'Sequence'] = -1
    df['Sequence_Sum'] = df['Sequence'].cumsum()
    df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1

    # take only items between START and END and generate Group number
    df2 = df[df['Sequence_Sum'] == 1].copy()
    df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
    df2['Group'] = df2['Sequence'].cumsum()

    # merge results to one dataframe
    df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
    df['Group'] = df['Group'].fillna(0)
    df['Group'] = df['Group'].astype(int)
    df.drop(['Position_Prev', 'Sequence', 'Sequence_Sum'], axis=1, inplace=True)    
    return df

@njit
def grouper(pos):
    res = np.empty(pos.shape)
    num = 1
    started = 0
    for i in range(len(res)):
        current_pos = pos[i]
        if (started == 0) and (current_pos == 0):
            started = 1
            res[i] = num
        elif (started == 1) and (current_pos == 1):
            started = 0
            res[i] = num
            num += 1
        elif (started == 1) and (current_pos in [-1, 0]):
            res[i] = num
        else:
            res[i] = 0
    return res

def jpp(df):
    arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
    df['Group'] = grouper(arr).astype(int)
    return df

这篇关于Pandas DataFrame根据条件进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆