解析不同的日期格式:正则表达式 [英] Parse Different Date formats: Regex

查看:45
本文介绍了解析不同的日期格式:正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重新发布这个问题的细节(因为最后一个被标记了).

reposting this questions with specifics (because the last one was flagged down).

我正在从档案卡解析凌乱的 (tessearct-ocr) 以获得至少 50% 的信息 (date1).根据下面的数据示例,数据行包含不同形式的日期.

I am working on parsing messy (tessearct-ocr) from archives cards to get atleast 50% of the info (date1). The data rows contain dates in different forms as per data sample below.

Raw_Text
1   "15957-8 . 3n v g - vw, 1 ekresta . bowker, william e tley n0 .qu v- l. c. 
    s. peteris, forestville, n. y. .mafae date1 june 17,1942 by davis, c. j6 
    l. g. b. jonnis, buffalo, n. y. ngsted decl 17, 1949.3y 7 davis, c. j. 
    date3 by j date4 - by date5 by 6 -.5/, 7/19/l date6 17 jul 1916 salamanca. 
    hf date7 31 dec 1986 buffalo, new york "
2   ".1o2o83n5ddn.. -i ekresta i bowles, albert edwin i made date1 june 9p1909 
    by parker, elm. date2 dec . 18 w date3 . by dep osed by date5 by date7mqm 
    9 ivvld wm 4144, mac, .75 076 eaqlwli "
3   "i naime bowles, charles edward made date1 may 31. 1892 by mclaren, wneoi 
    date2 may 18. 1895 by mclaren, w.e. date3 . i by date4 may 10. 1908 by 
    bip. of chicago. date5 by date7 "
4   "101 557 am l i ekrestaibowles, donald manson ..46 ohio trlnlty cathedral, 
    cleveland, ohio made date1 6/19/76 by burt, ji. h. grace , cleveland, ohio 
   date2 11 jun 77 by bp j h burt date3 . 1 .. by date4 by date5 bv m cuyahoga 
   heights, ohio date6 4/29/27 date7 240000 "
5   "227354 101 575 m68, frederick augustus st. paujjs cathedral, buffalo, 
   n.y. made date1 6/15/63 by scaife. l.i... st. thomas. modia, bath, n.y. 
   date2 1/11/611 by scaife. l.eo date3 by date4 by date5 by bradford, n.y. i 
   . 130m 6/1/18 date7 17 jun 1996 foratvme new york z4uc-xl "
6   "1 95812d ll. il ekresta bowles, harry oscar lmade date14 july 17, 190433, 
    lepnard, w.a. date2 july 25 , 1905 by leonard, w.a. i date3 by date4 by 
   date5 by g- m. /(,,/mr date7 jay /z/,. /357i l /mwi yk/maj. "
7   "5025 ,.. 2.57631 il . - . .. .1 i ekresta bowles , jedwiah hibbafd made 
    deac0n 8., i5-0i1862i13y potter, iih. date2 10. 280 1864 1 biy stevens, w. 
    b. date3 by date4 7 .30 l 1875 by date5 by date7 "
8   "30.611126 ekhq il ekresta bowles, ralph hart made date1 12. 210 i1883 by 
    iwiiiliams, i36 date2 7.. 1. 1885 by williams , j. date3 by i date4 by 
    date5 by g .97) l/am 9- date7 10. 4. 1900 (78) if x/ma 3.4, 154.47.11.73. 
    4,... mya-ix "
9   "2.25678 . 1o14593 ekresta bowles, robert brigham, jr. st. matthew s 
    cathedra1,da11quexas made date1 6/18/65 by mason, c. a. 57 mmzws camp 
    dr7///9s tams date2 12 21 cs by 14.45.42 c a date3 i by date4 by date5 , 
    by houston, texas date6 4/11/30 date7 12 dec 2000 dallas texas 2400-xi "
10  "101 619 34hq woe ekresta bowlin1 howard bruce cathedral modia of saint 
    peter 61 st. paul, washin ton, dc made date1 13 jun 92 bybp r h haines 
   (wdc st. alban1s modia, annandale, vir inia . pdumd 16 jan 93 by r h halnes 
    (wdc) date3 by atas by date4 v by date5 by date6 31 aug 1946 e st. louis. 
   il date7 2400-i "
11  "w k8 8km tm boiling jack dnnmwm q- f grace ch , made dat j 11201). salem 
    mares. stverrett. f. ,w a x st. johms modia. memphis, tenh. date1 apr. 25. 
    1955 - bv barth, t.in.. date3 4 by date4 by date5 by date7 wq iw r 1 w .n 
    . 4.1- 1 date6z1l7i1c. "

我通过两步过程解析 date1,- 1.解析名称date1"和by"之间的文本- 2.使用日期解析器提取实际日期

I parse date1 through two step process, - 1. Parse text between name "date1" and "by" - 2. Use date parser to extract the actual dates

import re
import dateutil.parser as dparser
for lines in Raw_Text:
    lines = lines.lower() #make lower case
    lines = lines.strip() #remove leading and ending spaces
    lines = " ".join(lines.split()) #remove duplicated spaces



    # Step 1
    #Extract data between "date1" and "by"
    deacondt = re.findall(r'date1(.*?)by',lines)

    deacondt = ''.join(deacondt)  #Convert list to a string


    # Step 2
    # use dateutil to parse dates in extracted data

    try:
        deacondt1 = dparser.parse(deacondt)
    except:
        deacondt1 = 'NA'

    print deacondt1

第 1 步的输出是,

[' june 17,1942 ']
[' june 9p1909 ']
[' may 31. 1892 ']
[' 6/19/76 ']
[' 6/15/63 ']
['4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ']
[]
[' 12. 210 i1883 ']
[' 6/18/65 ']
[' 13 jun 92 ']
[]

虽然第 2 步返回以下输出

While Step 2 returns the following output

2018-06-17 00:00:00
1909-06-17 21:00:00
1892-05-31 00:00:00
1976-06-19 00:00:00
2063-06-15 00:00:00
NA
NA
NA
2065-06-18 00:00:00
1992-06-13 00:00:00
NA

第 2 步未能给出所有日期.是否有比dateutil.parser"更好的 Python 2.7 日期解析器?

Step 2 fails to give all dates. Is there a better date parser for Python 2.7 than "dateutil.parser"?

推荐答案

你可以试试这个,

deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)

  • fuzzy – 允许包含 un-dateformat 字样的字符串,例如Today is January 1, 2047 at 8:21:00AM".
  • dayfirst=False 表示 month-first date-format 像你一样的输入字符串.
    • fuzzy – allowing strings containing un-dateformat words like "Today is January 1, 2047 at 8:21:00AM".
    • dayfirst=False means month-first date-format input string like yours.
    • 但是 dateutil-parser 不足以提取您想要的输出,因此需要将更近似于 date-format 的字符串传递给解析器.

      But it is insufficient for dateutil-parser to extract the output what you want, so more approximate string to date-format is needed to be passed to the parser.

      Regex 提取关于 date1

      (?s)date1\d?((?:(?!by|date2|date3).)*)
      

      Demo,,,, 其中不仅 'by' 还有 'date2' 和 'date3' 用作 separatordate10~date19 被视为 date1.

      Demo,,, in which not only 'by' but also 'date2' and 'date3' are used as separator and date10~date19 are regarded as date1.

      然后,对提取的字符串进行操作(删除前导和尾随空格等),以获得 date-util 解析器的可接受输入.

      And then, extracted string is manipulated(leading&trailing spaces removal, etc) for the acceptable input to date-util parser.

      regx= re.compile(r'(?s)date1\d?((?:(?!by|date2|date3).)*)')
      raw_date= [re.sub(r'(?i)(?<=\s)[a-z]?(\d{4}|\d{2})\d*', r'\1', re.sub(r'\s+|,|(?<=\d)[^\d\s\/](?=\d)',' ', re.sub(r'^\s+|\s+$|\n+','', m))) for m in regx.findall(Raw_Text)]
      
      for deacondt in raw_date: 
          try:
              deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)
          except:
              deacondt1 = 'NA'
      
      print(deacondt +"\n"+ str(deacondt1))
      

      输出

      june 17 1942
      1942-06-17 00:00:00
      june 9 1909
      1909-06-09 00:00:00
      may 31. 1892
      1892-05-31 00:00:00
      6/19/76
      1976-06-19 00:00:00
      6/15/63
      2063-06-15 00:00:00
      july 17  1904  lepnard  w.a.
      1904-07-17 00:00:00
      12. 21 1883
      1883-12-21 00:00:00
      6/18/65
      2065-06-18 00:00:00
      13 jun 92
      1992-06-13 00:00:00
      apr. 25. 1955 - bv barth  t.in..
      1955-04-25 00:00:00
      

      这篇关于解析不同的日期格式:正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆