Python Pandas-结合数据框的两行-有条件 [英] Python Pandas - combining 2 lines from data frame - with condition
问题描述
我有一个看起来像这样的Pandas数据框:
A B C Stime Etime
1220627 a 10.0 18:00:00 18:09:59
1220627 a 12.0 18:15:00 18:26:59
1220683 b 3.0 18:36:00 18:38:59
1220683 a 3.0 18:36:00 18:38:59
1220732 a 59.0 18:00:00 18:58:59
1220760 A 16.0 18:24:00 18:39:59
1220760 a 16.0 18:24:00 18:39:59
1220760 A 19.0 18:40:00 18:58:59
1220760 b 19.0 18:40:00 18:58:59
1220760 a 19.0 18:40:00 18:58:59
1220775 a 3.0 18:03:00 18:05:59
Stime和Etime列来自日期时间类型.
C是介于Stime和Etime之间的分钟数.
A col是家庭ID,B col是家庭中的人ID.
(以便A和B列共同代表一个唯一的人).
我需要做的就是更新表,以便对于某个人来说,如果Stime在结束时间之后到来-我将两行合并,并更新C.
例如,对于HH 1220760
中的人a
,第一个Etime
是18:39:59
,第二个Stime
是18:40:00
-刚好在18:39:59之后,所以我想将这些行合并起来,并将此人的C更新为35
(16 + 19). /p>
我尝试使用groupby
,但是我不知道如何添加Stime
将在Etime
之后出现的条件.
如果我们将Etime
加一秒,则可以通过按['A', 'B']
分组来查找要连接的行,然后针对每个组比较移位的Etime
s与下一个Stime
:
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
# A B C Etime Stime keep
# 0 1220627 a 10.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True
# 1 1220627 a 12.0 2016-05-29 18:27:00 2016-05-29 18:15:00 True
# 3 1220683 a 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True
# 2 1220683 b 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True
# 4 1220732 a 59.0 2016-05-29 18:59:00 2016-05-29 18:00:00 True
# 5 1220760 A 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True
# 7 1220760 A 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False
# 12 1220760 a 0.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True
# 6 1220760 a 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True
# 9 1220760 a 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False
# 11 1220760 a 11.0 2016-05-29 19:10:00 2016-05-29 18:59:00 False
# 8 1220760 b 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 True
# 10 1220775 a 3.0 2016-05-29 18:06:00 2016-05-29 18:03:00 True
我们将要保留keep
为True的行,并删除keep
为False的行,
除了我们还将要适当地更新Etime
.
如果我们可以为每行分配一个组号",以便我们可以按['A', 'B', 'group_number']
进行分组,那将是很好的-实际上我们可以.我们要做的就是将cumsum
应用于keep
列:
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
# A B C Etime Stime keep group_number
# 0 1220627 a 10.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True 1.0
# 1 1220627 a 12.0 2016-05-29 18:27:00 2016-05-29 18:15:00 True 2.0
# 3 1220683 a 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True 1.0
# 2 1220683 b 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True 1.0
# 4 1220732 a 59.0 2016-05-29 18:59:00 2016-05-29 18:00:00 True 1.0
# 5 1220760 A 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True 1.0
# 7 1220760 A 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False 1.0
# 12 1220760 a 0.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True 1.0
# 6 1220760 a 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True 2.0
# 9 1220760 a 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False 2.0
# 11 1220760 a 11.0 2016-05-29 19:10:00 2016-05-29 18:59:00 False 2.0
# 8 1220760 b 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 True 1.0
# 10 1220775 a 3.0 2016-05-29 18:06:00 2016-05-29 18:03:00 True 1.0
现在可以通过按['A', 'B', 'group_number']
分组找到所需的结果,
并找到每个组的最小Stime
和最大Etime
:
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
Stime Etime
A B group_number
1220627 a 1.0 2016-05-29 18:00:00 2016-05-29 18:10:00
2.0 2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0 2016-05-29 18:36:00 2016-05-29 18:39:00
b 1.0 2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0 2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0 2016-05-29 18:24:00 2016-05-29 18:59:00
a 1.0 2016-05-29 18:00:00 2016-05-29 18:10:00
2.0 2016-05-29 18:24:00 2016-05-29 19:10:00
b 1.0 2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0 2016-05-29 18:03:00 2016-05-29 18:06:00
将它们放在一起,
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'],
'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0],
'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
'18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00',
'18:03:00', '18:59:00', '18:00:00'],
'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
'18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59',
'18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)
收益
A B C Stime Etime
0 1220627 a 10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1 1220627 a 12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2 1220683 a 3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3 1220683 b 3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4 1220732 a 59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5 1220760 A 35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6 1220760 a 10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7 1220760 a 46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8 1220760 b 19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9 1220775 a 3.0 2016-05-29 18:03:00 2016-05-29 18:06:00
使用形式为[start, end)
的半开间隔的优势之一
而不是完全封闭的时间间隔[start, end]
是,当两个时间间隔相邻时,
的end
下一个的start
.
另一个优点是,半开间隔中的分钟数等于
end-start
.在完全封闭的时间间隔内,公式变为end-start+1
.
Python的内置range
和列表切片语法使用半开间隔
这些都一样
原因.所以我
建议您在DataFrame中使用半开间隔[Stime, Etime)
也是.
I have a Pandas data frame that looks like that:
A B C Stime Etime
1220627 a 10.0 18:00:00 18:09:59
1220627 a 12.0 18:15:00 18:26:59
1220683 b 3.0 18:36:00 18:38:59
1220683 a 3.0 18:36:00 18:38:59
1220732 a 59.0 18:00:00 18:58:59
1220760 A 16.0 18:24:00 18:39:59
1220760 a 16.0 18:24:00 18:39:59
1220760 A 19.0 18:40:00 18:58:59
1220760 b 19.0 18:40:00 18:58:59
1220760 a 19.0 18:40:00 18:58:59
1220775 a 3.0 18:03:00 18:05:59
Stime and Etime cols are from type datetime.
C is the number of minutes between Stime and Etime.
A col is household ID and B col is person ID in the household.
(so that cols A and B together represent a unique person).
What I need to do is to update the table such that if, for a certain person, the Stime comes right after the end time - I will unit the 2 lines and I will update C.
for example here, for person a
in HH 1220760
the first Etime
is 18:39:59
and the second Stime
is 18:40:00
- which comes right after 18:39:59, so I would like to unit the lines and update C for this person to be 35
(16+19).
I tried to use groupby
but I don't know how to add the condition that Stime
will come right after Etime
.
If we add one second to Etime
then we can find rows to be joined by grouping by ['A', 'B']
and then for each group comparing shifted Etime
s with the next Stime
:
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
# A B C Etime Stime keep
# 0 1220627 a 10.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True
# 1 1220627 a 12.0 2016-05-29 18:27:00 2016-05-29 18:15:00 True
# 3 1220683 a 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True
# 2 1220683 b 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True
# 4 1220732 a 59.0 2016-05-29 18:59:00 2016-05-29 18:00:00 True
# 5 1220760 A 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True
# 7 1220760 A 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False
# 12 1220760 a 0.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True
# 6 1220760 a 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True
# 9 1220760 a 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False
# 11 1220760 a 11.0 2016-05-29 19:10:00 2016-05-29 18:59:00 False
# 8 1220760 b 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 True
# 10 1220775 a 3.0 2016-05-29 18:06:00 2016-05-29 18:03:00 True
We will want to keep rows where keep
is True and remove rows where keep
is False,
except that we will also want to update the Etime
s as appropriate.
It would be nice if we could assign a "group number" to each row so that we could group by ['A', 'B', 'group_number']
-- and in fact we can. All we need to do is apply cumsum
to the keep
column:
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
# A B C Etime Stime keep group_number
# 0 1220627 a 10.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True 1.0
# 1 1220627 a 12.0 2016-05-29 18:27:00 2016-05-29 18:15:00 True 2.0
# 3 1220683 a 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True 1.0
# 2 1220683 b 3.0 2016-05-29 18:39:00 2016-05-29 18:36:00 True 1.0
# 4 1220732 a 59.0 2016-05-29 18:59:00 2016-05-29 18:00:00 True 1.0
# 5 1220760 A 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True 1.0
# 7 1220760 A 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False 1.0
# 12 1220760 a 0.0 2016-05-29 18:10:00 2016-05-29 18:00:00 True 1.0
# 6 1220760 a 16.0 2016-05-29 18:40:00 2016-05-29 18:24:00 True 2.0
# 9 1220760 a 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 False 2.0
# 11 1220760 a 11.0 2016-05-29 19:10:00 2016-05-29 18:59:00 False 2.0
# 8 1220760 b 19.0 2016-05-29 18:59:00 2016-05-29 18:40:00 True 1.0
# 10 1220775 a 3.0 2016-05-29 18:06:00 2016-05-29 18:03:00 True 1.0
Now the desired result can be found by grouping by ['A', 'B', 'group_number']
,
and finding the minimum Stime
and maximum Etime
for each group:
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
Stime Etime
A B group_number
1220627 a 1.0 2016-05-29 18:00:00 2016-05-29 18:10:00
2.0 2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0 2016-05-29 18:36:00 2016-05-29 18:39:00
b 1.0 2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0 2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0 2016-05-29 18:24:00 2016-05-29 18:59:00
a 1.0 2016-05-29 18:00:00 2016-05-29 18:10:00
2.0 2016-05-29 18:24:00 2016-05-29 19:10:00
b 1.0 2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0 2016-05-29 18:03:00 2016-05-29 18:06:00
Putting it all together,
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'],
'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0],
'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
'18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00',
'18:03:00', '18:59:00', '18:00:00'],
'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
'18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59',
'18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)
yields
A B C Stime Etime
0 1220627 a 10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1 1220627 a 12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2 1220683 a 3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3 1220683 b 3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4 1220732 a 59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5 1220760 A 35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6 1220760 a 10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7 1220760 a 46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8 1220760 b 19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9 1220775 a 3.0 2016-05-29 18:03:00 2016-05-29 18:06:00
One of the advantages of using half-open intervals of the form [start, end)
instead of fully-closed intervals [start, end]
is that when two interval abut,
the end
of one equals the start
of the next.
Another advantage is that the number of minutes in a half-open interval equals
end-start
. With a fully-closed interval, the formula becomes end-start+1
.
Python's builtin range
and list slicing syntax use half-open intervals for
these same
reasons. So I
would recommend using half-open intervals [Stime, Etime)
in your DataFrame
too.
这篇关于Python Pandas-结合数据框的两行-有条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!