Pandas DataFrame根据条件进行分组 [英] Pandas DataFrame groupby based on condition
问题描述
我发现的最相似的问题是此处,但没有正确的答案
The most similar question I found was here but with no proper answer.
基本上,我在尝试在数据帧上使用groupby来生成公交路线的唯一ID时遇到问题.问题是,我可以随时使用的数据(尽管很少)的groupby列具有相同的值,因此即使它们不是同一总线,它们也被视为同一总线.
Basically I have an issue where I'm trying to use groupby on a dataframe to generate unique IDs for bus routes. The problem is, the data I have at my disposal sometimes (though rarely) has the same values for my groupby columns, so they're considered the same bus even though they aren't.
我唯一想到的另一种方法是基于称为停止类型"的另一列对总线进行分组,其中有一个指示开始",中间"和结束"的指示器.我想使用groupby根据此列创建组,其中每个组从停止类型" =开始的地方开始,到停止类型" =结束的地方结束.
The only other way I can think of is to group buses based on another column called "Type of stop", where there is an indicator for Start, Middle and End. I'd like to use groupby to create groups based on this column where each group starts where "type of stop" = Start, and ends where "type of stop" = End.
请考虑以下数据:
df = pd.DataFrame({'Vehicle_ID': ['A']*18,
'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3)})
Cond Position
0 A START
1 A MID
2 A MID
3 A END
4 A MID
5 A START
6 A START
7 A MID
8 A MID
9 A END
10 A MID
11 A START
12 A START
13 A MID
14 A MID
15 A END
16 A MID
17 A START
我想出的将这些总线准确地分组在一起的唯一方法是生成带有总线序列ID的附加列,但是鉴于我要处理大量数据,所以这不是一个非常有效的解决方案.我希望能够通过一个groupby来管理我想做的事情,以便生成以下输出
The only way I came up with to accurately group these buses together is to generate an additional column with the bus sequence id, but given that I'm working with lots of data, this isn't a very efficient solution. I'm hoping to be able to manage what I want to do with a single groupby, if possible, in order to generate the following output
Cond Position Group
0 A START 1
1 A MID 1
2 A MID 1
3 A END 1
4 A MID
5 A START 2
6 A START 2
7 A MID 2
8 A MID 2
9 A END 2
10 A MID
11 A START 3
12 A START 3
13 A MID 3
14 A MID 3
15 A END 3
16 A MID
17 A START 4
推荐答案
一个想法是通过np.select
分解,然后通过numba
使用自定义循环:
One idea is to factorize via np.select
, then use a custom loop via numba
:
from numba import njit
df = pd.DataFrame({'Vehicle_ID': ['A']*18,
'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})
@njit
def grouper(pos):
res = np.empty(pos.shape)
num = 1
started = 0
for i in range(len(res)):
current_pos = pos[i]
if (started == 0) and (current_pos == 0):
started = 1
res[i] = num
elif (started == 1) and (current_pos == 1):
started = 0
res[i] = num
num += 1
elif (started == 1) and (current_pos in [-1, 0]):
res[i] = num
else:
res[i] = 0
return res
arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
df['Group'] = grouper(arr).astype(int)
结果:
print(df)
Position Vehicle_ID Group
0 START A 1
1 MID A 1
2 MID A 1
3 END A 1
4 MID A 0
5 START A 2
6 START A 2
7 MID A 2
8 MID A 2
9 END A 2
10 MID A 0
11 START A 3
12 START A 3
13 MID A 3
14 MID A 3
15 END A 3
16 MID A 0
17 START A 4
我认为,您不应不包含空白"值,因为这将迫使您的系列成为object
dtype,对于任何后续处理均效率低下.如上所述,您可以改用0
.
In my opinion, you should not include "blank" values as this would force your series to be object
dtype, inefficient for any subsequent processing. As above, you can use 0
instead.
numba
比一种纯熊猫方法快约10倍:-
numba
is around ~10x faster than one pure Pandas approach:-
import pandas as pd, numpy as np
from numba import njit
df = pd.DataFrame({'Vehicle_ID': ['A']*18,
'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})
df = pd.concat([df]*10, ignore_index=True)
assert joz(df.copy()).equals(jpp(df.copy()))
%timeit joz(df.copy()) # 18.6 ms per loop
%timeit jpp(df.copy()) # 1.95 ms per loop
基准功能:
def joz(df):
# identification of sequences
df['Position_Prev'] = df['Position'].shift(1)
df['Sequence'] = 0
df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
df.loc[df['Position'] == 'END', 'Sequence'] = -1
df['Sequence_Sum'] = df['Sequence'].cumsum()
df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1
# take only items between START and END and generate Group number
df2 = df[df['Sequence_Sum'] == 1].copy()
df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
df2['Group'] = df2['Sequence'].cumsum()
# merge results to one dataframe
df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
df['Group'] = df['Group'].fillna(0)
df['Group'] = df['Group'].astype(int)
df.drop(['Position_Prev', 'Sequence', 'Sequence_Sum'], axis=1, inplace=True)
return df
@njit
def grouper(pos):
res = np.empty(pos.shape)
num = 1
started = 0
for i in range(len(res)):
current_pos = pos[i]
if (started == 0) and (current_pos == 0):
started = 1
res[i] = num
elif (started == 1) and (current_pos == 1):
started = 0
res[i] = num
num += 1
elif (started == 1) and (current_pos in [-1, 0]):
res[i] = num
else:
res[i] = 0
return res
def jpp(df):
arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
df['Group'] = grouper(arr).astype(int)
return df
这篇关于Pandas DataFrame根据条件进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!