pandas 根据日期组合行 [英] Pandas combining rows based on dates

查看:63
本文介绍了 pandas 根据日期组合行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个客户数据框,其中包含他们收到的发货记录.不幸的是,这些可能会重叠.我正在尝试减少行数,以便可以看到连续使用的日期.除了蛮力迭代实现之外,还有什么方法可以做到这一点?

I have a dataframe of customers with records for shipments they received. Unfortunately, these can overlap. I'm trying to reduce rows so that I can see dates of consecutive use. Is there any way to do this besides a brute force iterrows implementation?

这是一个示例,我想做的事情:

Here's a sample and what I'd like to do:

df = pd.DataFrame([['A','2011-02-07','2011-02-22',1],['A','2011-02-14','2011-03-10',2],['A','2011-03-07','2011-03-15',3],['A','2011-03-18','2011-03-25',4]], columns = ['Cust','startDate','endDate','shipNo'])
df

condensedDf = df.groupby(['Cust']).apply(reductionFunction)
condensedDF

reduceFunction将前三个记录分组为一个,因为在每种情况下,下一个记录的开始日期都在前一个记录的结束日期之前.我实际上是将重叠的多条记录变成一条记录.

the reductionFunction will group the first 3 records into one, because in each case the start date of the next is before the end date of the prior. I'm essentially turning multiple records that overlap into one record.

关于良好的"pythonic"实现的想法?我可以在每个组中进行讨厌的while循环,但我不想...

Thoughts on a good "pythonic" implementation? I could do a nasty while loop within each group, but I'd prefer not to...

推荐答案

从根本上讲,我认为这是一个图连接性问题:一种快速解决该问题的方法将是某种方式的图连接性算法. Pandas不包含此类工具,但 scipy包含.您可以在scipy中使用压缩的稀疏图(csgraph)子模块来解决您的问题,如下所示:

Fundamentally, I think this is a graph connectivity problem: a fast way of solving it will be some manner of graph connectivity algorithm. Pandas doesn't include such tools, but scipy does. You can use the compressed sparse graph (csgraph) submodule in scipy to solve your problem like this:

from scipy.sparse.csgraph import connected_components

# convert to datetime, so min() and max() work
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)

def reductionFunction(data):
    # create a 2D graph of connectivity between date ranges
    start = data.startDate.values
    end = data.endDate.values
    graph = (start <= end[:, None]) & (end >= start[:, None])

    # find connected components in this graph
    n_components, indices = connected_components(graph)

    # group the results by these connected components
    return data.groupby(indices).aggregate({'startDate': 'min',
                                            'endDate': 'max',
                                            'shipNo': 'first'})

df.groupby(['Cust']).apply(reductionFunction).reset_index('Cust')

如果您想从这里开始使用shipNo做一些不同的事情,那应该很简单.

If you want to do something different with shipNo from here, it should be pretty straightforward.

请注意, connected_components() 函数不是蛮力的,而是使用快速算法来查找连接.

Note that the connected_components() function above is not brute force, but uses a fast algorithm to find the connections.

这篇关于 pandas 根据日期组合行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆