设置parse_date = ['列名']时,pd.read_csv无法正确解析日期/月份字段 [英] pd.read_csv not correctly parsing date/month field when set parse_date = ['column name']

查看:1191
本文介绍了设置parse_date = ['列名']时,pd.read_csv无法正确解析日期/月份字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在尝试通过pandas.read_csv()的parse_dates解析几个日期时遇到了此错误.在以下代码段中,我试图解析格式为dd/mm/yy的日期,这导致我进行了不正确的转换.在某些情况下,日期字段被视为月份,反之亦然.

I ran in to this bug while trying to parse the few dates through parse_dates of pandas.read_csv(). In the following code snippet, I'm trying to parse dates that have format dd/mm/yy which is resulting me an improper conversion. For some cases, the date field is considered as month and vice versa.

为简单起见,在某些情况下,将dd/mm/yy转换为yyyy-dd-mm而不是yyyy-mm-dd.

To keep it simple, for some cases dd/mm/yy get converted to yyyy-dd-mm instead of yyyy-mm-dd.

情况1:

  04/10/96 is parsed as 1996-04-10, which is wrong.

情况2:

  15/07/97 is parsed as 1997-07-15, which is correct.

情况3:

  10/12/97 is parsed as 1997-10-12, which is wrong.

代码示例

import pandas as pd

df = pd.read_csv('date_time.csv') 
print 'Data in csv:'
print df
print df['start_date'].dtypes

print '----------------------------------------------'

df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes

电流输出

----------------------
Data in csv:
----------------------
  start_date
0   04/10/96
1   15/07/97
2   10/12/97
3   06/03/99
4     //1994
5   /02/1967
object
----------------------
Data after parsing:
----------------------
   start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]

预期产量

----------------------
Data in csv:
----------------------
   start_date
0   04/10/96
1   15/07/97
2   10/12/97
3   06/03/99
4     //1994
5   /02/1967
object
----------------------
Data after parsing:
----------------------
  start_date

0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]

更多评论:

我可以使用date_parserpandas.to_datetime()指定日期的正确格式.但就我而言,我没有几个日期字段,例如['//1997', '/02/1967'],我需要将其转换为['01/01/1997','01/02/1967']. parse_dates帮助我将这些类型的日期字段转换为预期的格式,而无需编写额外的代码行.

I could use date_parser or pandas.to_datetime() to specify the proper format for date. But in my case, I have few date fields like ['//1997', '/02/1967'] for which I need to convert ['01/01/1997','01/02/1967']. The parse_dates helps me in converting those type of date fields to the expected format without making me to write extra line of code.

对此有什么解决办法吗?

Is there any solution for this?

错误链接@GitHub: https://github.com/pydata/pandas/issues/13063

Bug Link @GitHub: https://github.com/pydata/pandas/issues/13063

推荐答案

在熊猫0.18.0版本中,您可以添加参数dayfirst=True,然后它会起作用:

In version pandas 0.18.0 you can add parameter dayfirst=True and then it works:

import pandas as pd
import io

temp=u"""start_date
04/10/96
15/07/97
10/12/97
06/03/99
//1994
/02/1967
"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),  parse_dates = ['start_date'], dayfirst=True)
  start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01

另一种解决方案:

您可以使用 to_datetime 进行解析不同的参数formaterrors='coerce',然后 :

You can parsing with to_datetime with different parameters format and errors='coerce' and then combine_first:

date1 = pd.to_datetime(df['start_date'], format='%d/%m/%y', errors='coerce')
print date1
0   1996-10-04
1   1997-07-15
2   1997-12-10
3   1999-03-06
4          NaT
5          NaT
Name: start_date, dtype: datetime64[ns]

date2 = pd.to_datetime(df['start_date'], format='/%m/%Y', errors='coerce')
print date2
0          NaT
1          NaT
2          NaT
3          NaT
4          NaT
5   1967-02-01
Name: start_date, dtype: datetime64[ns]

date3 = pd.to_datetime(df['start_date'], format='//%Y', errors='coerce')
print date3
0          NaT
1          NaT
2          NaT
3          NaT
4   1994-01-01
5          NaT
Name: start_date, dtype: datetime64[ns]

print date1.combine_first(date2).combine_first(date3)
0   1996-10-04
1   1997-07-15
2   1997-12-10
3   1999-03-06
4   1994-01-01
5   1967-02-01
Name: start_date, dtype: datetime64[ns]

这篇关于设置parse_date = ['列名']时,pd.read_csv无法正确解析日期/月份字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆