如何在python中解析多个(未知)日期格式? [英] How can I parse multiple (unknown) date formats in python?
问题描述
好的事情是我知道它总是月/日
10/02/09
07/22/09
09-08-2008
9/9/2008
11/4 / 2010
03-07-2009
09/01/2010
喜欢把它们全部变成MM / DD / YYYY格式。有没有办法这样做,而不是尝试每个模式的字符串?
进口re
ss ='''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''
regx = re.compile('[ - /] ')
for ss.splitlines()中的xd:
m,d,y = regx.split(xd)
print xd,'','/'join((m.zfill (2),d.zfill(2),'20'+ y.zfill(2)如果len(y)== 2 else y))
结果
10/02/09 10/02/2009
07/22/09 07/22/2009
09-08-2008 09/08/2008
9/9/2008 09/09/2008
11/4/2010 11 / 04/2010
03-07-2009 03/07/2009
09/01/2010 09/01/2010
编辑1
和编辑2 :考虑 {0:0> 2}'。格式(天)
从JBernardo,我添加了第四个解决方案,似乎是最快的
import re
from time import clock
iterat = 100
from datetime import datetime
dates = ['10 / 02/09','07 / 22/09','09 -08-2008','9/9/2008','11 / 4/2010 ',
'03-07-2009','09 / 01/2010']
reobj = re.compile(
r\s *#可选的空格
(\d +)#月
[ - /]#分隔符
(\d +)#日
[ - /]#分隔符
(?:20 )? #世纪(可选)
(\d +)#年(YY)
\s *#可选空白,
re.VERBOSE)
te = clock()
在xrange(iterat)中的i:
ndates =(reobj.sub(r\1 / \2 / 20\3,date)日期的日期)
fdates1 = [datetime.strftime(datetime.strptime(date,%m /%d /%Y),%m /%d /%Y)
for ndates]
打印Tim的方法,clock() - te,'seconds'
regx = re.compile('[ - /]')
te = clock()
for x in xrange(iterat):
ndates =(reobj.match(date).groups()for date in date)
fdates2 = ['%s /%s / 20%s'%tuple(x.zfill(2)for x in tu)for tu in ndates]
printmixing solution,clock() - te,'seconds'
te = clock()
在xrange(iterat)中的$:
ndates =(regx.split(date.strip() )日期的日期)
fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2)if len(y) == 2 else y)))
for m,d,y in ndates]
打印eyquem的方法,clock() - te,'seconds'
te = clock()
for x in xrange(iterat):
fdates4 = ['{:0> 2} / {:0> 2} / 20 {}'format(* reobj.match(date) )日期在日期]
打印Tim +格式,clock() - te,'秒'
打印fdates1 == fdates2 == fdates3 == fdates4
结果
迭代次数:100
Tim的方法0.295053700959秒
混合解决方案0.0459111423379秒
eyquem的方法0.0192239516475秒
Tim +格式0.0153756971906秒
True
混合解决方案很有趣,因为它将我的解决方案的速度和Tim Pietzcker的正则表达式的能力结合起来, strong>检测日期在一个字符串。
对于将Tim的一个和 {:0> ; 2}
。我不能结合 {:0> 2}
与我的 regx.split(date.strip())
年份为2或4位数字
I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:
The good thing is I know it is always Month/Day
10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010
I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?
import re
ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''
regx = re.compile('[-/]')
for xd in ss.splitlines():
m,d,y = regx.split(xd)
print xd,' ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))
result
10/02/09 10/02/2009
07/22/09 07/22/2009
09-08-2008 09/08/2008
9/9/2008 09/09/2008
11/4/2010 11/04/2010
03-07-2009 03/07/2009
09/01/2010 09/01/2010
Edit 1
And Edit 2 : taking account of the information on '{0:0>2}'.format(day)
from JBernardo, I added a 4th solution, that appears to be the fastest
import re
from time import clock
iterat = 100
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
' 03-07-2009', '09/01/2010']
reobj = re.compile(
r"""\s* # optional whitespace
(\d+) # Month
[-/] # separator
(\d+) # Day
[-/] # separator
(?:20)? # century (optional)
(\d+) # years (YY)
\s* # optional whitespace""",
re.VERBOSE)
te = clock()
for i in xrange(iterat):
ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
for date in ndates]
print "Tim's method ",clock()-te,'seconds'
regx = re.compile('[-/]')
te = clock()
for i in xrange(iterat):
ndates = (reobj.match(date).groups() for date in dates)
fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
ndates = (regx.split(date.strip()) for date in dates)
fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format ",clock()-te,'seconds'
print fdates1==fdates2==fdates3==fdates4
result
number of iteration's turns : 100
Tim's method 0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format 0.0153756971906 seconds
True
The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.
That's still more true for the solution combining Tim's one and the formating with {:0>2}
. I cant' combine {:0>2}
with mine because regx.split(date.strip())
produces year with 2 OR 4 digits
这篇关于如何在python中解析多个(未知)日期格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!