在Pandas DataFrame中查找连续日期组 [英] Find group of consecutive dates in Pandas DataFrame

查看:1039
本文介绍了在Pandas DataFrame中查找连续日期组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Pandas DataFrame中获取具有连续日期的数据块.我的df如下所示.

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

在此df中,我要获取前3行,进行一些处理,然后获取后3行,并对此进行处理.

通过应用以下代码,我以1滞后计算了差异.

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

但是在那之后,我无法弄清楚如何在不进行迭代的情况下获取连续行的组.

解决方案

似乎您需要两个布尔掩码:一个用于确定组之间的间隔,另一个用于确定哪个日期首先位于组中. /p>

还有一个棘手的部分可以通过示例来充实.请注意,下面的df包含一个添加的行,该行之前或之后没有任何连续的日期.

>>> df
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253
4   2017-01-20  0.485949  # < watch out for this
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

>>> df.dtypes
DateAnalyzed    datetime64[ns]
Val                    float64
dtype: object

以下答案假定您要完全忽略2017-01-20,而不对其进行处理. (如果您确实想处理此日期,请参见答案的结尾以获取解决方案.)

第一:

>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: DateAnalyzed, dtype: bool

现在,in_block会告诉您哪些日期在连续"块中,但不会告诉您每个日期属于哪个组.

下一步是推导分组本身:

>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1    1
2    1
3    1
5    2
6    2
7    2
Name: DateAnalyzed, dtype: int64

然后您可以通过选择的操作来调用df.groupby(groups).

>>> for _, frame in filt.groupby(groups):
...     print(frame, end='\n\n')
... 
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253

  DateAnalyzed       Val
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

要将其重新整合到df中,请为其分配,并且隔离的日期将为NaN:

>>> df['groups'] = groups
>>> df
  DateAnalyzed       Val  groups
1   2018-03-18  0.470253     1.0
2   2018-03-19  0.470253     1.0
3   2018-03-20  0.470253     1.0
4   2017-01-20  0.485949     NaN
5   2018-09-25  0.467729     2.0
6   2018-09-26  0.467729     2.0
7   2018-09-27  0.467729     2.0


如果您确实想包含孤独"日期,事情会变得更加简单:

dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()

I am trying to get the chunks of data where there's consecutive dates from the Pandas DataFrame. My df looks like below.

      DateAnalyzed           Val
1       2018-03-18      0.470253
2       2018-03-19      0.470253
3       2018-03-20      0.470253
4       2018-09-25      0.467729
5       2018-09-26      0.467729
6       2018-09-27      0.467729

In this df, I want to get the first 3 rows, do some processing and then get the last 3 rows and do processing on that.

I calculated the difference with 1 lag by applying following code.

df['Delta']=(df['DateAnalyzed'] - df['DateAnalyzed'].shift(1))

But after then I can't figure out that how to get the groups of consecutive rows without iterating.

解决方案

It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.

There's also one tricky part that can be fleshed out by example. Notice that df below contains an added row that doesn't have any consecutive dates before or after it.

>>> df
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253
4   2017-01-20  0.485949  # < watch out for this
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

>>> df.dtypes
DateAnalyzed    datetime64[ns]
Val                    float64
dtype: object

The answer below assumes that you want to ignore 2017-01-20 completely, without processing it. (See end of answer for a solution if you do want to process this date.)

First:

>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: DateAnalyzed, dtype: bool

Now, in_block will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.

The next step is to derive the groupings themselves:

>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1    1
2    1
3    1
5    2
6    2
7    2
Name: DateAnalyzed, dtype: int64

Then you can call df.groupby(groups) with your operation of choice.

>>> for _, frame in filt.groupby(groups):
...     print(frame, end='\n\n')
... 
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253

  DateAnalyzed       Val
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

To incorporate this back into df, assign to it and the isolated dates will be NaN:

>>> df['groups'] = groups
>>> df
  DateAnalyzed       Val  groups
1   2018-03-18  0.470253     1.0
2   2018-03-19  0.470253     1.0
3   2018-03-20  0.470253     1.0
4   2017-01-20  0.485949     NaN
5   2018-09-25  0.467729     2.0
6   2018-09-26  0.467729     2.0
7   2018-09-27  0.467729     2.0


If you do want to include the "lone" date, things become a bit more straightforward:

dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()

这篇关于在Pandas DataFrame中查找连续日期组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆