pandas to_datetime解析错误的年份 [英] pandas to_datetime parsing wrong year

查看:68
本文介绍了 pandas to_datetime解析错误的年份的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到的事情几乎可以肯定是我的一个愚蠢的错误,但是我似乎无法弄清楚发生了什么.

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.

从本质上讲,我有一系列日期,以字符串形式,格式为"%d-%b-%y",例如26-Sep-05.当我将其转换为日期时间时,年份有时是正确的,但有时并非如此.

Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.

例如:

dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']

pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
               '2061-01-09', '2055-02-08'],
              dtype='datetime64[ns]', freq=None)

最后两个条目(这些年份分别返回2061和2055)是错误的.但这对于15-Jun-70条目来说很好用.这是怎么回事?

The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What's going on here?

推荐答案

这似乎是Python库datetime的行为,我进行了一项测试,以了解临界值在68-69:

That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:

datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)

datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)

年份两位数的歧义

因此,%y年低于69的任何事物都应归因于2000年,而69以后的年份则归为1900

So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900

%y的两位数字只能从0099,如果我们开始跨越几个世纪的话,这将是模棱两可的.

The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.

如果没有重叠,则可以手动处理并注释世纪(消除歧义)

我建议您手动处理数据并指定世纪,例如您可以确定数据中年份介于17到68之间的任何内容都归因于1917年-1968年(而不是2017年-2068年).

I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).

如果您有重叠之处,则无法处理年份不足的信息,除非例如您有一些有序的数据和参考

如果您有重叠的地方,例如您同时拥有2016年和1916年的数据,并且都记录为"16",这是模棱两可的,并且没有足够的信息来对此进行解析,除非按日期对数据进行排序,在这种情况下,您可以使用启发式方法来随着世纪的变化而改变解析它.

If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.

这篇关于 pandas to_datetime解析错误的年份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆