Python Pandas-结合数据框的两行-有条件 [英] Python Pandas - combining 2 lines from data frame - with condition

查看:73
本文介绍了Python Pandas-结合数据框的两行-有条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的Pandas数据框:

A       B     C    Stime    Etime    
1220627 a   10.0 18:00:00 18:09:59
1220627 a   12.0 18:15:00 18:26:59
1220683 b   3.0  18:36:00 18:38:59
1220683 a   3.0  18:36:00 18:38:59
1220732 a   59.0 18:00:00 18:58:59
1220760 A   16.0 18:24:00 18:39:59
1220760 a   16.0 18:24:00 18:39:59
1220760 A   19.0 18:40:00 18:58:59
1220760 b   19.0 18:40:00 18:58:59
1220760 a   19.0 18:40:00 18:58:59
1220775 a   3.0  18:03:00 18:05:59

Stime和Etime列来自日期时间类型.

C是介于Stime和Etime之间的分钟数.

A col是家庭ID,B col是家庭中的人ID.

(以便A和B列共同代表一个唯一的人).

我需要做的就是更新表,以便对于某个人来说,如果Stime在结束时间之后到来-我将两行合并,并更新C.

例如,对于HH 1220760中的人a,第一个Etime18:39:59

,第二个Stime18:40:00-刚好在18:39:59之后,所以我想将这些行合并起来,并将此人的C更新为35(16 + 19). /p>

我尝试使用groupby,但是我不知道如何添加Stime将在Etime之后出现的条件.

解决方案

如果我们将Etime加一秒,则可以通过按['A', 'B']分组来查找要连接的行,然后针对每个组比较移位的Etime s与下一个Stime:

df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
#           A  B     C               Etime               Stime   keep
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True

我们将要保留keep为True的行,并删除keep为False的行, 除了我们还将要适当地更新Etime.

如果我们可以为每行分配一个组号",以便我们可以按['A', 'B', 'group_number']进行分组,那将是很好的-实际上我们可以.我们要做的就是将cumsum应用于keep列:

df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
#           A  B     C               Etime               Stime   keep  group_number
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True           2.0
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True           1.0
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           1.0
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           1.0
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           2.0
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           2.0
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False           2.0
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True           1.0
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True           1.0

现在可以通过按['A', 'B', 'group_number']分组找到所需的结果, 并找到每个组的最小Stime和最大Etime:

result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})

                                     Stime               Etime
A       B group_number                                        
1220627 a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
        b 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0          2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0          2016-05-29 18:24:00 2016-05-29 18:59:00
        a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:24:00 2016-05-29 19:10:00
        b 1.0          2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0          2016-05-29 18:03:00 2016-05-29 18:06:00


将它们放在一起,

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
           1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
     'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'], 
     'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0], 
     'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
               '18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00', 
               '18:03:00', '18:59:00', '18:00:00'],
     'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
               '18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59', 
               '18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
    df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)

收益

         A  B     C               Stime               Etime
0  1220627  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1  1220627  a  12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2  1220683  a   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3  1220683  b   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4  1220732  a  59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5  1220760  A  35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6  1220760  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7  1220760  a  46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8  1220760  b  19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9  1220775  a   3.0 2016-05-29 18:03:00 2016-05-29 18:06:00


使用形式为[start, end)的半开间隔的优势之一 而不是完全封闭的时间间隔[start, end]是,当两个时间间隔相邻时, end 下一个的start.

另一个优点是,半开间隔中的分钟数等于 end-start.在完全封闭的时间间隔内,公式变为end-start+1.

Python的内置range和列表切片语法使用半开间隔 这些都一样 原因.所以我 建议您在DataFrame中使用半开间隔[Stime, Etime) 也是.

I have a Pandas data frame that looks like that:

A       B     C    Stime    Etime    
1220627 a   10.0 18:00:00 18:09:59
1220627 a   12.0 18:15:00 18:26:59
1220683 b   3.0  18:36:00 18:38:59
1220683 a   3.0  18:36:00 18:38:59
1220732 a   59.0 18:00:00 18:58:59
1220760 A   16.0 18:24:00 18:39:59
1220760 a   16.0 18:24:00 18:39:59
1220760 A   19.0 18:40:00 18:58:59
1220760 b   19.0 18:40:00 18:58:59
1220760 a   19.0 18:40:00 18:58:59
1220775 a   3.0  18:03:00 18:05:59

Stime and Etime cols are from type datetime.

C is the number of minutes between Stime and Etime.

A col is household ID and B col is person ID in the household.

(so that cols A and B together represent a unique person).

What I need to do is to update the table such that if, for a certain person, the Stime comes right after the end time - I will unit the 2 lines and I will update C.

for example here, for person a in HH 1220760 the first Etime is 18:39:59

and the second Stime is 18:40:00 - which comes right after 18:39:59, so I would like to unit the lines and update C for this person to be 35 (16+19).

I tried to use groupby but I don't know how to add the condition that Stime will come right after Etime.

解决方案

If we add one second to Etime then we can find rows to be joined by grouping by ['A', 'B'] and then for each group comparing shifted Etimes with the next Stime:

df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
#           A  B     C               Etime               Stime   keep
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True

We will want to keep rows where keep is True and remove rows where keep is False, except that we will also want to update the Etimes as appropriate.

It would be nice if we could assign a "group number" to each row so that we could group by ['A', 'B', 'group_number'] -- and in fact we can. All we need to do is apply cumsum to the keep column:

df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
#           A  B     C               Etime               Stime   keep  group_number
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True           2.0
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True           1.0
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           1.0
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           1.0
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           2.0
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           2.0
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False           2.0
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True           1.0
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True           1.0

Now the desired result can be found by grouping by ['A', 'B', 'group_number'], and finding the minimum Stime and maximum Etime for each group:

result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})

                                     Stime               Etime
A       B group_number                                        
1220627 a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
        b 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0          2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0          2016-05-29 18:24:00 2016-05-29 18:59:00
        a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:24:00 2016-05-29 19:10:00
        b 1.0          2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0          2016-05-29 18:03:00 2016-05-29 18:06:00


Putting it all together,

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
           1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
     'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'], 
     'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0], 
     'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
               '18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00', 
               '18:03:00', '18:59:00', '18:00:00'],
     'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
               '18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59', 
               '18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
    df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)

yields

         A  B     C               Stime               Etime
0  1220627  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1  1220627  a  12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2  1220683  a   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3  1220683  b   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4  1220732  a  59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5  1220760  A  35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6  1220760  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7  1220760  a  46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8  1220760  b  19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9  1220775  a   3.0 2016-05-29 18:03:00 2016-05-29 18:06:00


One of the advantages of using half-open intervals of the form [start, end) instead of fully-closed intervals [start, end] is that when two interval abut, the end of one equals the start of the next.

Another advantage is that the number of minutes in a half-open interval equals end-start. With a fully-closed interval, the formula becomes end-start+1.

Python's builtin range and list slicing syntax use half-open intervals for these same reasons. So I would recommend using half-open intervals [Stime, Etime) in your DataFrame too.

这篇关于Python Pandas-结合数据框的两行-有条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆