如何使用python pandas 清除多种格式的日期范围? [英] How can I clean date ranges in multiple formats using python pandas?
问题描述
我有一个数据框,其中包含一些混合格式的日期,如下所示:
I have a dataframe that contains some dates in mixed format as follows:
import pandas as pd
dates = ['Dec-03',
'03/11/2003 - 05/04/2004',
'Apr-04',
'2004 - 2005',
'01/02/2005 - 31/03/2005']
df = pd.DataFrame(dates, columns = ["date_range"])
日期可以采用三种格式,如上例所示:两年;一个月;我希望找到一种有效且 pythonic的方式来在数据框中创建开始日期和结束日期列,并获得以下结果:
The dates can come in three formats as shown in the example above: two years; a single month; two dates together.
date_range start_dates end_dates
0 Dec-03 01/12/2003 31/12/2003
1 03/11/2003 - 05/04/2004 03/11/2003 05/04/2004
2 Apr-04 01/04/2004 30/04/2004
3 2004 - 2005 01/01/2004 31/12/2005
4 01/02/2005 - 31/03/2005 01/02/2005 31/03/2005
我已经尝试了涉及df.iterrows和一些if语句的解决方案,但是我想知道是否有一种更有效的方法来解决此问题。完整的数据集包含数百万行,因此使用矢量化函数或类似函数的某些函数会很好地工作。
I have experimented with solutions involving df.iterrows and some if statements, but I was wondering if there is a more efficient method to solve this problem. The full dataset contains millions of rows so something that uses a vectorised function or similar would work well.
推荐答案
我不认为有一种方法可以在一个向量化操作中执行此操作。但是,您可以将数据帧切成几块,每块都有自己的数据范围格式。对于这些切片中的每个切片,您都可以以矢量化方式计算开始和结束日期。由于日期格式的数量比记录的数量小很多,因此应该很快。
I don't think there's a way to do this in one vectorized operation. What you can do, however, is slice the dataframe into several chunks - each with its own data range format. For each of these slices, you can calculate the start and end dates in a vectorized manner. Since the number of date format is much smaller than the number of records, it should be pretty fast.
这是一个实现:
from pandas.tseries.offsets import MonthEnd, YearEnd
df["start_time"] = pd.NaT
df["end_time"] = pd.NaT
mask = df.date_range.str.match(r"\w{3}-\d{2}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"], format = "%b-%y")
df.loc[mask, "end_time"] = df.loc[mask, "start_time"] + MonthEnd(1)
mask = df.date_range.str.match(r"\d{4}\s*-\s*\d{4}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[0].str.strip(),
format="%Y")
df.loc[mask, "end_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[1].str.strip(),
format="%Y") + YearEnd(1)
mask = df.date_range.str.match(r"\d{2}/\d{2}/\d{4} - \d{2}/\d{2}/\d{4}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[0].str.strip(),
format="%d/%m/%Y")
df.loc[mask, "end_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[1].str.strip(),
format="%d/%m/%Y")
结果是:
date_range start_time end_time
0 Dec-03 2003-12-01 2003-12-31
1 03/11/2003 - 05/04/2004 2003-11-03 2004-04-05
2 Apr-04 2004-04-01 2004-04-30
3 2004 - 2005 2004-01-01 2005-12-31
4 01/02/2005 - 31/03/2005 2005-02-01 2005-03-31
这篇关于如何使用python pandas 清除多种格式的日期范围?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!