如何在python中删除组中的某些行 [英] How to remove some rows in a group by in python
问题描述
我有一个数据框,我想基于列进行groupby()
,然后根据日期列对每个组中的值进行排序.然后,我要从每个记录中删除其column_condition == 'B'
值的记录,直到到达其column_condition == 'A'
的行.例如,假设下表是组之一
I'm having a dataframe and I'd like to do a groupby()
based a column and then sort the values within each group based on a date column. Then, from a each I'd like to remove records whose value for column_condition == 'B'
until I reach to a row whose column_condition == 'A'
. For example, Assume the table below is one of the groups
ID, DATE, column_condition
--------------------------
1, jan 2017, B
1, Feb 2017, B
1, Mar 2017, B
1, Aug 2017, A
1, Sept 2017, B
因此,我想删除前三行,而只剩下最后两行.我该怎么办?
So, I'd like to remove the first three rows and leave this group with only the last two rows. How can I do that?
推荐答案
我想我终于理解了您的问题:您希望按'ID'
groupby
a dataframe
,按日期排序,并保留行之后的行. condition
列中'A'
的第一次出现.我提出了以下一种班轮解决方案:
I think I finally understand your question: you wish to groupby
a dataframe
by 'ID'
, sort by date, and keep the rows after the first ocurrence of 'A'
in your condition
column. I've come up with the following one liner solution:
设置虚拟数据
import pandas as pd
import datetime as dt
d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
],
'condition': ['B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B']
}
# 'DATE' but with list comprehension:
# [dt.date.today() + dt.timedelta(days=30*x) for y in range(0, 2) for x in range(0, 5)]
df = pd.DataFrame(d)
翻译
>>> (df.sort_values(by='DATE') # we should call pd.to_datetime() first if...
... .groupby('ID') # 'DATE' is not datetime already
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))
ID DATE condition
ID
1 3 1 2018-10-17 A
4 1 2018-11-16 B
2 8 2 2018-10-17 A
9 2 2018-11-16 B
如果您需要这样的话,您也可以致电reset_index(drop=True)
:
You can also call reset_index(drop=True)
, if you need something like this:
ID DATE condition
0 1 2018-10-17 A
1 1 2018-11-16 B
2 2 2018-10-17 A
3 2 2018-11-16 B
(x['condition'].values == 'A')
返回一个bool
np.array
,然后调用argmax()
给我们提供索引,在该位置第一次出现True
的情况发生(在本例中为condition == 'A'
).使用该索引,我们用slice
子组化每个组.
(x['condition'].values == 'A')
returns a bool
np.array
, and calling argmax()
gives us then index where the first ocurrence of True
happens (where condition == 'A'
in this case). Using that index, we're subsetting each of the groups with a slice
.
添加了用于处理仅包含不良条件的组的过滤器.
Added filter for dealing with groups that only contain the undesired condition.
d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
], # ID 1 only contains 'B'
'condition': ['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B']
}
df = pd.DataFrame(d)
翻译
>>> df
ID DATE condition
0 1 2018-07-19 B
1 1 2018-08-18 B
2 1 2018-09-17 B
3 1 2018-10-17 B
4 1 2018-11-16 B
5 2 2018-07-19 B
6 2 2018-08-18 B
7 2 2018-09-17 B
8 2 2018-10-17 A
9 2 2018-11-16 B
>>> (df.sort_values(by='DATE')
... .groupby('ID')
... .filter(lambda x: (x['condition'] == 'A').any())
... .groupby('ID')
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))
ID DATE condition
ID
2 8 2 2018-10-17 A
9 2 2018-11-16 B
这篇关于如何在python中删除组中的某些行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!