如何使用python pandas 清除多种格式的日期范围? [英] How can I clean date ranges in multiple formats using python pandas?

查看:73
本文介绍了如何使用python pandas 清除多种格式的日期范围?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含一些混合格式的日期,如下所示:

I have a dataframe that contains some dates in mixed format as follows:

import pandas as pd

dates = ['Dec-03',
         '03/11/2003 - 05/04/2004',
         'Apr-04',
         '2004 - 2005',
         '01/02/2005 - 31/03/2005']

df = pd.DataFrame(dates, columns = ["date_range"])

日期可以采用三种格式,如上例所示:两年;一个月;我希望找到一种有效且 pythonic的方式来在数据框中创建开始日期和结束日期列,并获得以下结果:

The dates can come in three formats as shown in the example above: two years; a single month; two dates together.

    date_range                         start_dates  end_dates
0   Dec-03                             01/12/2003   31/12/2003
1   03/11/2003 - 05/04/2004            03/11/2003   05/04/2004
2   Apr-04                             01/04/2004   30/04/2004
3   2004 - 2005                        01/01/2004   31/12/2005
4   01/02/2005 - 31/03/2005            01/02/2005   31/03/2005

我已经尝试了涉及df.iterrows和一些if语句的解决方案,但是我想知道是否有一种更有效的方法来解决此问题。完整的数据集包含数百万行,因此使用矢量化函数或类似函数的某些函数会很好地工作。

I have experimented with solutions involving df.iterrows and some if statements, but I was wondering if there is a more efficient method to solve this problem. The full dataset contains millions of rows so something that uses a vectorised function or similar would work well.

推荐答案

我不认为有一种方法可以在一个向量化操作中执行此操作。但是,您可以将数据帧切成几块,每块都有自己的数据范围格式。对于这些切片中的每个切片,您都可以以矢量化方式计算开始和结束日期。由于日期格式的数量比记录的数量小很多,因此应该很快。

I don't think there's a way to do this in one vectorized operation. What you can do, however, is slice the dataframe into several chunks - each with its own data range format. For each of these slices, you can calculate the start and end dates in a vectorized manner. Since the number of date format is much smaller than the number of records, it should be pretty fast.

这是一个实现:

from pandas.tseries.offsets import MonthEnd, YearEnd

df["start_time"] = pd.NaT
df["end_time"] = pd.NaT

mask = df.date_range.str.match(r"\w{3}-\d{2}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"], format = "%b-%y")
df.loc[mask, "end_time"] = df.loc[mask, "start_time"] + MonthEnd(1)

mask = df.date_range.str.match(r"\d{4}\s*-\s*\d{4}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[0].str.strip(), 
                                            format="%Y")
df.loc[mask, "end_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[1].str.strip(), 
                                            format="%Y") + YearEnd(1) 


mask = df.date_range.str.match(r"\d{2}/\d{2}/\d{4} - \d{2}/\d{2}/\d{4}")

df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[0].str.strip(), 
                                            format="%d/%m/%Y")

df.loc[mask, "end_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[1].str.strip(), 
                                            format="%d/%m/%Y")

结果是:

                date_range start_time   end_time
0                   Dec-03 2003-12-01 2003-12-31
1  03/11/2003 - 05/04/2004 2003-11-03 2004-04-05
2                   Apr-04 2004-04-01 2004-04-30
3              2004 - 2005 2004-01-01 2005-12-31
4  01/02/2005 - 31/03/2005 2005-02-01 2005-03-31

这篇关于如何使用python pandas 清除多种格式的日期范围?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆