如何在 python 中解析多个(未知)日期格式? [英] How can I parse multiple (unknown) date formats in python?
问题描述
我有一堆要从中提取日期的 Excel 文档.我正在尝试将这些转换为标准格式,以便将它们放入数据库中.有没有一个函数可以抛出这些字符串并返回标准格式?这是我的数据的一个小样本:
I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:
好消息是我知道它总是月/日
The good thing is I know it is always Month/Day
10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010
我想将它们全部转换为 MM/DD/YYYY 格式.有没有一种方法可以在不针对字符串尝试每个模式的情况下执行此操作?
I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?
推荐答案
import re
ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''
regx = re.compile('[-/]')
for xd in ss.splitlines():
m,d,y = regx.split(xd)
print xd,' ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))
结果
10/02/09 10/02/2009
07/22/09 07/22/2009
09-08-2008 09/08/2008
9/9/2008 09/09/2008
11/4/2010 11/04/2010
03-07-2009 03/07/2009
09/01/2010 09/01/2010
编辑 1
和 Edit 2 :考虑到 JBernardo 关于 '{0:0>2}'.format(day)
的信息,我添加了第四个解决方案,这似乎是最快的
Edit 1
And Edit 2 : taking account of the information on '{0:0>2}'.format(day)
from JBernardo, I added a 4th solution, that appears to be the fastest
import re
from time import clock
iterat = 100
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
' 03-07-2009', '09/01/2010']
reobj = re.compile(
r"""s* # optional whitespace
(d+) # Month
[-/] # separator
(d+) # Day
[-/] # separator
(?:20)? # century (optional)
(d+) # years (YY)
s* # optional whitespace""",
re.VERBOSE)
te = clock()
for i in xrange(iterat):
ndates = (reobj.sub(r"1/2/203", date) for date in dates)
fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
for date in ndates]
print "Tim's method ",clock()-te,'seconds'
regx = re.compile('[-/]')
te = clock()
for i in xrange(iterat):
ndates = (reobj.match(date).groups() for date in dates)
fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
ndates = (regx.split(date.strip()) for date in dates)
fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format ",clock()-te,'seconds'
print fdates1==fdates2==fdates3==fdates4
结果
number of iteration's turns : 100
Tim's method 0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format 0.0153756971906 seconds
True
混合解决方案很有趣,因为它结合了我的解决方案的速度和 Tim Pietzcker 的正则表达式检测字符串中日期的能力.
The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.
对于将 Tim 的解决方案与 {:0>2}
的格式相结合的解决方案来说更是如此.我不能将 {:0>2}
与我的结合起来,因为 regx.split(date.strip())
产生 2 或 4 位数字的年份
That's still more true for the solution combining Tim's one and the formating with {:0>2}
. I cant' combine {:0>2}
with mine because regx.split(date.strip())
produces year with 2 OR 4 digits
这篇关于如何在 python 中解析多个(未知)日期格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!