在PANDAS groupby中计算具有连续日期的行 [英] Count rows with consecutive dates within PANDAS groupby

查看：625 发布时间：2017/3/26 3:54:18 python pandas dataframe

本文介绍了在PANDAS groupby中计算具有连续日期的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的数据框看起来像这样：

Let's say my dataframe looks something like this:

d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
     'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
     'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}

df = pd.DataFrame(data=d)

我想计算连续几天观察到相同 item_number 和 Comp_ID 的次数。

I would like to count the amount of times where the same item_number and Comp_ID were observed on consecutive days.

我想像这样会看起来像：

I imagine this will look something along the lines of:

g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())

然而，我会需要在比较之前将每个日期的日期提取为int，我也遇到麻烦。

However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.

for i in df.index:
    wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day

显然，基于位置的索引只允许整数？我如何解决这个问题？

Apparently location based indexing only allows for integers? How could I solve this problem?

为什么？

要修复代码，您需要：

Why?

To fix your code, you need:

consecutive['date'] = pd.to_datetime(consecutive['date'])
g = consecutive.groupby(['Comp_ID','item_number'])
g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D')))

请注意以下事项：

上面的代码避免了重复。这是一个基本的规划原则：不要重复自己

将1转换为 timedelta 进行适当比较。

绝对差异。

The code above avoids repetitions. That is a basic programming principle: Don't Repeat Yourself
It converts 1 to timedelta for proper comparison.
It takes the absolute difference.

提示，为您的工作编写顶级功能，而不是 lambda ，因为它具有更好的可读性，简洁性和美观性：

Tip, write a top level function for your work, instead of a lambda, as it accords better readability, brevity, and aesthetics:

def differencer(grp, day_dif):
    """Counts rows in grp separated by day_dif day(s)"""
    d = abs(grp.shift(-1) - grp)
    return sum(d == pd.to_timedelta(day_dif, unit='D'))
g['date'].apply(differencer, day_dif=1)

说明：

这很简单。日期是转换为时间戳类型，然后减去。差异将导致 timedelta ，这也需要与 timedelta 对象进行比较，因此转换1（或 day_dif ）到 timedelta 。该转换的结果将是一个布尔系列。布尔值由0表示为 False ，而对于 True 则为1。一个布尔系列的总和将返回系列中 True 值的总数。

Explanation:

It is pretty straightforward. The dates are converted to Timestamp type, then subtracted. The difference will result in a timedelta, which needs to also be compared with a timedelta object, hence the conversion of 1 (or day_dif) to timedelta. The result of that conversion will be a Boolean Series. Boolean are represented by 0 for False and 1 for True. Sum of a Boolean Series will return the total number of True values in the Series.

这篇关于在PANDAS groupby中计算具有连续日期的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在PANDAS groupby中计算具有连续日期的行 [英] Count rows with consecutive dates within PANDAS groupby

问题描述

推荐答案

为什么？

Why?

说明：

Explanation:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在PANDAS groupby中计算具有连续日期的行 [英] Count rows with consecutive dates within PANDAS groupby

问题描述

推荐答案

为什么？

Why?

说明：

Explanation:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭