Python用时间条件删除行 [英] Python removing rows with time condition

查看:135
本文介绍了Python用时间条件删除行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两组数据框,都具有唯一的标识符和日期时间数据,其格式如下

I have 2 sets of Dataframe, both with an unique identifier and a datetime data in the format as such

"2020-01-01 00:00:01" -datetime和"12345"-唯一标识符和类型

"2020-01-01 00:00:01"-datetime and "12345" - unique identifier and Type

第一个问题,DF1:

   DatetimeX            ID    Type
   2020-01-01 02:00:01 12345 C
   2020-01-01 02:00:03 12345 C
   2020-01-01 05:00:03 12345 C
   2020-01-01 05:03:05 12345 C
   2020-01-01 03:00:09 13333 D
   2020-01-01 02:00:09 12345 C
   2020-01-01 02:01:35 12345 C
   2020-01-01 02:10:35 12345 C
   2020-01-01 02:00:01 13333 D
   2020-01-01 02:05:35 13333 D
   2020-01-01 02:00:50 13333 E
   2020-01-01 02:00:01 12211 C
   2020-01-01 02:09:50 13333 E
   2020-01-01 02:11:50 13333 E

我想基于具有相同类型"的ID的第一个时间戳,并在此之后10分钟删除行:

I would like to based on the ID's 1st time stamp with the same "Type", and remove the rows 10mins after as such:

   DatetimeX            ID    Type
   2020-01-01 02:00:01 12345 C
   2020-01-01 05:00:03 12345 C
   2020-01-01 02:10:35 12345 C
   2020-01-01 03:00:09 13333 D
   2020-01-01 02:00:01 13333 D
   2020-01-01 02:00:50 13333 E
   2020-01-01 02:00:01 12211 C
   2020-01-01 02:11:50 13333 E

我尝试探索时间范围/日期范围,但是找不到任何类似的编码概念.希望如果有人能指出我可以探索的方式而不是尝试寻求完整的解决方案.几年没有接触过python并且以前不熟悉它.谢谢

I've tried to explore timerange/daterange but could not find any similar concept of coding. Would hope that if anyone can point out what kind of ways i can look into to explore and not trying to get a full solution. Have not touch python for a few years and not familiar with it previously. Thank you

已更新了其他数据行,以获取更准确的示例

推荐答案

添加示例输入数据并简化了过程:

Add sample input data and simplfied the process:

Timestamp = pd.to_datetime
data = [{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:03'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 05:00:03'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 05:03:05'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 03:00:09'), 'ID': 13333, 'Type': 'D'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:09'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:01:35'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:10:35'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 13333, 'Type': 'D'},
 {'DatetimeX': Timestamp('2020-01-01 02:05:35'), 'ID': 13333, 'Type': 'D'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:50'), 'ID': 13333, 'Type': 'E'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12211, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:09:50'), 'ID': 13333, 'Type': 'E'},
 {'DatetimeX': Timestamp('2020-01-01 02:11:50'), 'ID': 13333, 'Type': 'E'}]
df1 = pd.DataFrame(data)


col_raw = df1.columns
while True:
    df1.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
    df1['diff1_lt10min'] = df1.groupby(['ID', 'Type'])['DatetimeX'].diff().dt.seconds < 10 * 60
    df1['tag_group'] = (~df1['diff1_lt10min']).cumsum()
    if df1.duplicated('tag_group').sum()==0:
        break
    df1 = df1.merge((df1.groupby('tag_group')['DatetimeX'].first()
               .reset_index()
               .rename(columns={'DatetimeX':'DatetimeX_1st'})),
              on='tag_group')
    df1['diff2_lt10min'] = (df1.DatetimeX - df1.DatetimeX_1st).dt.seconds < 10 * 60
    cond = df1['diff1_lt10min'] & df1['diff2_lt10min']
    df1 = df1.loc[~cond, col_raw]
df1 = df1[col_raw]


详细信息...


Detail...

# repeat
col_raw = df1.columns
df4 = df1.copy()
n_round = 1
while True:
    print('#'*20, f'round {n_round}', '#'*20)
    # step 1 sort the values & group by ['Type', 'ID'] calculate the DatetimeX's time diff
    # notice: the time-diff is not the actual wanted
    df = df4[col_raw].copy()
    df.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
    df['diff'] = df.groupby(['Type', 'ID'])['DatetimeX'].diff()
    print('#'*10, 'step1', '#'*10)
    print(df)

    # step 2, create a tag column to store the first 10min gap from 'diff' column
    cond = False 
    cond |= df['diff'].dt.seconds > 10 * 60
    cond |= df['diff'].isnull()
    df['tag'] = np.where(cond, 1, 0)
    df['tag'] = df['tag'].cumsum().fillna(method = 'ffill')
    print('#'*10, 'step2', '#'*10)
    print(df)

    # step 3, use 'tag' to judge to stop the while loop or not
    # tag should be unique
    break_sign = df.tag.duplicated().sum()
    if break_sign == 0:
        break
    print('#'*10, 'step3', '#'*10)
    print(break_sign)
    
    # step 4:
        # create a 'DatetimeX_1st' with the 'tag' group's first DatetimeX
        # create a 'diff2' = 'DatetimeX' - 'DatetimeX_1st'
    df2 = df.reset_index().set_index('tag')
    df2['DatetimeX_1st'] = df.groupby('tag').first()['DatetimeX']
    df2['diff2'] = df2['DatetimeX'] - df2['DatetimeX_1st']
    print('#'*10, 'step4', '#'*10)
    print(df2)
    
    # step 5:
        # drop the True < 10min gaps records
        # 'diff' and 'diff2' should all < 10min
    cond = (df2['diff2'].dt.seconds < 10 * 60) & (df2['diff'].dt.seconds < 10 * 60)
    df3 = df2[~cond].copy()
    print('#'*10, 'step5', '#'*10)
    print(df3)
    
    
    # step 6:
        # reset index
    cols = 'tag DatetimeX   ID  Type'.split()
    df4 = df3.reset_index().set_index('index').sort_index()[cols]
    print('#'*10, 'step6', '#'*10)
    print(df4)
    
    n_round += 1
    print()
    
# get result
result = df[['DatetimeX', 'ID', 'Type']].copy()
result.index.name = None
print()
print('#'*10, 'result', '#'*10)
print(result)


输出:


output:

#################### round 1 ####################
########## step1 ##########
             DatetimeX     ID Type            diff
11 2020-01-01 02:00:01  12211    C             NaT
0  2020-01-01 02:00:01  12345    C             NaT
1  2020-01-01 02:00:03  12345    C 0 days 00:00:02
5  2020-01-01 02:00:09  12345    C 0 days 00:00:06
6  2020-01-01 02:01:35  12345    C 0 days 00:01:26
7  2020-01-01 02:10:35  12345    C 0 days 00:09:00
2  2020-01-01 05:00:03  12345    C 0 days 02:49:28
3  2020-01-01 05:03:05  12345    C 0 days 00:03:02
8  2020-01-01 02:00:01  13333    D             NaT
9  2020-01-01 02:05:35  13333    D 0 days 00:05:34
4  2020-01-01 03:00:09  13333    D 0 days 00:54:34
10 2020-01-01 02:00:50  13333    E             NaT
12 2020-01-01 02:09:50  13333    E 0 days 00:09:00
13 2020-01-01 02:11:50  13333    E 0 days 00:02:00
########## step2 ##########
             DatetimeX     ID Type            diff  tag
11 2020-01-01 02:00:01  12211    C             NaT    1
0  2020-01-01 02:00:01  12345    C             NaT    2
1  2020-01-01 02:00:03  12345    C 0 days 00:00:02    2
5  2020-01-01 02:00:09  12345    C 0 days 00:00:06    2
6  2020-01-01 02:01:35  12345    C 0 days 00:01:26    2
7  2020-01-01 02:10:35  12345    C 0 days 00:09:00    2
2  2020-01-01 05:00:03  12345    C 0 days 02:49:28    3
3  2020-01-01 05:03:05  12345    C 0 days 00:03:02    3
8  2020-01-01 02:00:01  13333    D             NaT    4
9  2020-01-01 02:05:35  13333    D 0 days 00:05:34    4
4  2020-01-01 03:00:09  13333    D 0 days 00:54:34    5
10 2020-01-01 02:00:50  13333    E             NaT    6
12 2020-01-01 02:09:50  13333    E 0 days 00:09:00    6
13 2020-01-01 02:11:50  13333    E 0 days 00:02:00    6
########## step3 ##########
8
########## step4 ##########
     index           DatetimeX     ID Type            diff  \
tag                                                          
1       11 2020-01-01 02:00:01  12211    C             NaT   
2        0 2020-01-01 02:00:01  12345    C             NaT   
2        1 2020-01-01 02:00:03  12345    C 0 days 00:00:02   
2        5 2020-01-01 02:00:09  12345    C 0 days 00:00:06   
2        6 2020-01-01 02:01:35  12345    C 0 days 00:01:26   
2        7 2020-01-01 02:10:35  12345    C 0 days 00:09:00   
3        2 2020-01-01 05:00:03  12345    C 0 days 02:49:28   
3        3 2020-01-01 05:03:05  12345    C 0 days 00:03:02   
4        8 2020-01-01 02:00:01  13333    D             NaT   
4        9 2020-01-01 02:05:35  13333    D 0 days 00:05:34   
5        4 2020-01-01 03:00:09  13333    D 0 days 00:54:34   
6       10 2020-01-01 02:00:50  13333    E             NaT   
6       12 2020-01-01 02:09:50  13333    E 0 days 00:09:00   
6       13 2020-01-01 02:11:50  13333    E 0 days 00:02:00   

          DatetimeX_1st           diff2  
tag                                      
1   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:00:02  
2   2020-01-01 02:00:01 0 days 00:00:08  
2   2020-01-01 02:00:01 0 days 00:01:34  
2   2020-01-01 02:00:01 0 days 00:10:34  
3   2020-01-01 05:00:03 0 days 00:00:00  
3   2020-01-01 05:00:03 0 days 00:03:02  
4   2020-01-01 02:00:01 0 days 00:00:00  
4   2020-01-01 02:00:01 0 days 00:05:34  
5   2020-01-01 03:00:09 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:09:00  
6   2020-01-01 02:00:50 0 days 00:11:00  
########## step5 ##########
     index           DatetimeX     ID Type            diff  \
tag                                                          
1       11 2020-01-01 02:00:01  12211    C             NaT   
2        0 2020-01-01 02:00:01  12345    C             NaT   
2        7 2020-01-01 02:10:35  12345    C 0 days 00:09:00   
3        2 2020-01-01 05:00:03  12345    C 0 days 02:49:28   
4        8 2020-01-01 02:00:01  13333    D             NaT   
5        4 2020-01-01 03:00:09  13333    D 0 days 00:54:34   
6       10 2020-01-01 02:00:50  13333    E             NaT   
6       13 2020-01-01 02:11:50  13333    E 0 days 00:02:00   

          DatetimeX_1st           diff2  
tag                                      
1   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:10:34  
3   2020-01-01 05:00:03 0 days 00:00:00  
4   2020-01-01 02:00:01 0 days 00:00:00  
5   2020-01-01 03:00:09 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:11:00  
########## step6 ##########
       tag           DatetimeX     ID Type
index                                     
0        2 2020-01-01 02:00:01  12345    C
2        3 2020-01-01 05:00:03  12345    C
4        5 2020-01-01 03:00:09  13333    D
7        2 2020-01-01 02:10:35  12345    C
8        4 2020-01-01 02:00:01  13333    D
10       6 2020-01-01 02:00:50  13333    E
11       1 2020-01-01 02:00:01  12211    C
13       6 2020-01-01 02:11:50  13333    E

#################### round 2 ####################
########## step1 ##########
                DatetimeX     ID Type            diff
index                                                
11    2020-01-01 02:00:01  12211    C             NaT
0     2020-01-01 02:00:01  12345    C             NaT
7     2020-01-01 02:10:35  12345    C 0 days 00:10:34
2     2020-01-01 05:00:03  12345    C 0 days 02:49:28
8     2020-01-01 02:00:01  13333    D             NaT
4     2020-01-01 03:00:09  13333    D 0 days 01:00:08
10    2020-01-01 02:00:50  13333    E             NaT
13    2020-01-01 02:11:50  13333    E 0 days 00:11:00
########## step2 ##########
                DatetimeX     ID Type            diff  tag
index                                                     
11    2020-01-01 02:00:01  12211    C             NaT    1
0     2020-01-01 02:00:01  12345    C             NaT    2
7     2020-01-01 02:10:35  12345    C 0 days 00:10:34    3
2     2020-01-01 05:00:03  12345    C 0 days 02:49:28    4
8     2020-01-01 02:00:01  13333    D             NaT    5
4     2020-01-01 03:00:09  13333    D 0 days 01:00:08    6
10    2020-01-01 02:00:50  13333    E             NaT    7
13    2020-01-01 02:11:50  13333    E 0 days 00:11:00    8

########## result ##########
             DatetimeX     ID Type
11 2020-01-01 02:00:01  12211    C
0  2020-01-01 02:00:01  12345    C
7  2020-01-01 02:10:35  12345    C
2  2020-01-01 05:00:03  12345    C
8  2020-01-01 02:00:01  13333    D
4  2020-01-01 03:00:09  13333    D
10 2020-01-01 02:00:50  13333    E
13 2020-01-01 02:11:50  13333    E

这篇关于Python用时间条件删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆