pandas -日期范围内每天的新行 [英] Pandas - New Row for Each Day in Date Range

查看:69
本文介绍了 pandas -日期范围内每天的新行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Pandas df,其中一列(Reservation_Dt_Start)代表日期范围的开始,另一列(Reservation_Dt_End)代表日期范围的结束.

I have a Pandas df with one column (Reservation_Dt_Start) representing the start of a date range and another (Reservation_Dt_End) representing the end of a date range.

我希望将每一行扩展为具有与该日期范围内的日期一样多的记录,而不是每行都有一个日期范围,而每行都代表这些日期之一.

Rather than each row having a date range, I'd like to expand each row to have as many records as there are dates in the date range, with each new row representing one of those dates.

请参见下面的两个图片,以获取示例输入和所需的输出.

See the two pics below for an example input and the desired output.

下面的代码段有效!!但是,对于输入表中的每250行,需要1秒钟才能运行.鉴于我的输入表的大小为120,000,000行,因此此代码将花费大约一周的时间来运行.

The code snippet below works!! However, for every 250 rows in the input table, it takes 1 second to run. Given my input table is 120,000,000 rows in size, this code will take about one week to run.

pd.concat([pd.DataFrame({'Book_Dt': row.Book_Dt,
                         'Day_Of_Reservation': pd.date_range(row.Reservation_Dt_Start, row.Reservation_Dt_End),
                         'Pickup': row.Pickup,
                         'Dropoff' : row.Dropoff,
                         'Price': row.Price}, 

                          columns=['Book_Dt','Day_Of_Reservation', 'Pickup', 'Dropoff' , 'Price']) 
                          for i, row in df.iterrows()], ignore_index=True)

必须有一种更快的方法来执行此操作.有任何想法吗?谢谢!

There has to be a faster way to do this. Any ideas? Thanks!

推荐答案

pd.concat变得非常慢,因为它将每次都复制该帧并返回一个新的数据帧.您正在尝试执行此操作1.2亿次.我会尝试将这些数据作为一个简单的元组列表使用,而不是在最后转换为数据框.

pd.concat in a loop with a large dataset gets pretty slow as it will make a copy of the frame each time and return a new dataframe. You are attempting to do this 120m times. I would try to work with this data as a simple list of tuples instead then convert to dataframe at the end.

例如

提供列表list = []

对于数据框中的每一行:

For each row in the dataframe:

  • 获取日期范围列表(可以在此处仍使用pd.date_range)存储在变量dates中,该变量是日期列表

  • get list of date range (can use pd.date_range here still) store in variable dates which is a list of dates

对于日期范围内的每个日期,在列表list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))

for each date in date range, add a tuple to the list list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))

最后,您可以将元组列表转换为数据框:

Finally you can convert the list of tuples to a dataframe:

df = pd.DataFrame(list, columns = ['Book_Dt', 'Day_Of_Reservation', 'Pickup', 'Dropoff', 'Price'])

这篇关于 pandas -日期范围内每天的新行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆