pandas 根据日期组合行 [英] Pandas combining rows based on dates
问题描述
我有一个客户数据框,其中包含他们收到的发货记录.不幸的是,这些可能会重叠.我正在尝试减少行数,以便可以看到连续使用的日期.除了蛮力迭代实现之外,还有什么方法可以做到这一点?
I have a dataframe of customers with records for shipments they received. Unfortunately, these can overlap. I'm trying to reduce rows so that I can see dates of consecutive use. Is there any way to do this besides a brute force iterrows implementation?
这是一个示例,我想做的事情:
Here's a sample and what I'd like to do:
df = pd.DataFrame([['A','2011-02-07','2011-02-22',1],['A','2011-02-14','2011-03-10',2],['A','2011-03-07','2011-03-15',3],['A','2011-03-18','2011-03-25',4]], columns = ['Cust','startDate','endDate','shipNo'])
df
condensedDf = df.groupby(['Cust']).apply(reductionFunction)
condensedDF
reduceFunction将前三个记录分组为一个,因为在每种情况下,下一个记录的开始日期都在前一个记录的结束日期之前.我实际上是将重叠的多条记录变成一条记录.
the reductionFunction will group the first 3 records into one, because in each case the start date of the next is before the end date of the prior. I'm essentially turning multiple records that overlap into one record.
关于良好的"pythonic"实现的想法?我可以在每个组中进行讨厌的while循环,但我不想...
Thoughts on a good "pythonic" implementation? I could do a nasty while loop within each group, but I'd prefer not to...
推荐答案
从根本上讲,我认为这是一个图连接性问题:一种快速解决该问题的方法将是某种方式的图连接性算法. Pandas不包含此类工具,但 scipy包含.您可以在scipy中使用压缩的稀疏图(csgraph
)子模块来解决您的问题,如下所示:
Fundamentally, I think this is a graph connectivity problem: a fast way of solving it will be some manner of graph connectivity algorithm. Pandas doesn't include such tools, but scipy does. You can use the compressed sparse graph (csgraph
) submodule in scipy to solve your problem like this:
from scipy.sparse.csgraph import connected_components
# convert to datetime, so min() and max() work
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
def reductionFunction(data):
# create a 2D graph of connectivity between date ranges
start = data.startDate.values
end = data.endDate.values
graph = (start <= end[:, None]) & (end >= start[:, None])
# find connected components in this graph
n_components, indices = connected_components(graph)
# group the results by these connected components
return data.groupby(indices).aggregate({'startDate': 'min',
'endDate': 'max',
'shipNo': 'first'})
df.groupby(['Cust']).apply(reductionFunction).reset_index('Cust')
如果您想从这里开始使用shipNo
做一些不同的事情,那应该很简单.
If you want to do something different with shipNo
from here, it should be pretty straightforward.
请注意, connected_components()
函数不是蛮力的,而是使用快速算法来查找连接.
Note that the connected_components()
function above is not brute force, but uses a fast algorithm to find the connections.
这篇关于 pandas 根据日期组合行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!