设置parse_date = ['列名']时,pd.read_csv无法正确解析日期/月份字段 [英] pd.read_csv not correctly parsing date/month field when set parse_date = ['column name']
问题描述
我在尝试通过pandas.read_csv()
的parse_dates解析几个日期时遇到了此错误.在以下代码段中,我试图解析格式为dd/mm/yy
的日期,这导致我进行了不正确的转换.在某些情况下,日期字段被视为月份,反之亦然.
I ran in to this bug while trying to parse the few dates through parse_dates of pandas.read_csv()
. In the following code snippet, I'm trying to parse dates that have format dd/mm/yy
which is resulting me an improper conversion. For some cases, the date field is considered as month and vice versa.
为简单起见,在某些情况下,将dd/mm/yy
转换为yyyy-dd-mm
而不是yyyy-mm-dd
.
To keep it simple, for some cases dd/mm/yy
get converted to yyyy-dd-mm
instead of yyyy-mm-dd
.
情况1:
04/10/96 is parsed as 1996-04-10, which is wrong.
情况2:
15/07/97 is parsed as 1997-07-15, which is correct.
情况3:
10/12/97 is parsed as 1997-10-12, which is wrong.
代码示例
import pandas as pd
df = pd.read_csv('date_time.csv')
print 'Data in csv:'
print df
print df['start_date'].dtypes
print '----------------------------------------------'
df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes
电流输出
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]
预期产量
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]
更多评论:
我可以使用date_parser
或pandas.to_datetime()
指定日期的正确格式.但就我而言,我没有几个日期字段,例如['//1997', '/02/1967']
,我需要将其转换为['01/01/1997','01/02/1967']
. parse_dates
帮助我将这些类型的日期字段转换为预期的格式,而无需编写额外的代码行.
I could use date_parser
or pandas.to_datetime()
to specify the proper format for date. But in my case, I have few date fields like ['//1997', '/02/1967']
for which I need to convert ['01/01/1997','01/02/1967']
. The parse_dates
helps me in converting those type of date fields to the expected format without making me to write extra line of code.
对此有什么解决办法吗?
Is there any solution for this?
错误链接@GitHub: https://github.com/pydata/pandas/issues/13063
Bug Link @GitHub: https://github.com/pydata/pandas/issues/13063
推荐答案
在熊猫0.18.0
版本中,您可以添加参数dayfirst=True
,然后它会起作用:
In version pandas 0.18.0
you can add parameter dayfirst=True
and then it works:
import pandas as pd
import io
temp=u"""start_date
04/10/96
15/07/97
10/12/97
06/03/99
//1994
/02/1967
"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates = ['start_date'], dayfirst=True)
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
另一种解决方案:
您可以使用 to_datetime
进行解析不同的参数format
和errors='coerce'
,然后
You can parsing with to_datetime
with different parameters format
and errors='coerce'
and then combine_first
:
date1 = pd.to_datetime(df['start_date'], format='%d/%m/%y', errors='coerce')
print date1
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 NaT
5 NaT
Name: start_date, dtype: datetime64[ns]
date2 = pd.to_datetime(df['start_date'], format='/%m/%Y', errors='coerce')
print date2
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
5 1967-02-01
Name: start_date, dtype: datetime64[ns]
date3 = pd.to_datetime(df['start_date'], format='//%Y', errors='coerce')
print date3
0 NaT
1 NaT
2 NaT
3 NaT
4 1994-01-01
5 NaT
Name: start_date, dtype: datetime64[ns]
print date1.combine_first(date2).combine_first(date3)
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
Name: start_date, dtype: datetime64[ns]
这篇关于设置parse_date = ['列名']时,pd.read_csv无法正确解析日期/月份字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!