适用于各种日期的Python / Pandas正则表达式 [英] Python/Pandas Regex for a Wide Variety of Dates

查看:144
本文介绍了适用于各种日期的Python / Pandas正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是使用Python从文本文件中提取各种各样的日期。

I have a task to extract a wide variety of dates from a text file using Python.

根据要求,必须从文本文件中正确提取以下日期格式:

As per the requirements, the following date formats must be properly extracted from the text file:


  • 2009年4月20日; 09年4月20日; 09/4/20; 2009年4月3日

  • 2009年3月20日; 2009年3月20日; 2009年3月20日; 2009年3月20日; 2009年3月20日;

  • 2009年3月20日; 2009年3月20日; 2009年3月20日; 2009年3月20日

  • 2009年3月20日; 2009年3月21日; 2009年3月22日

  • 2009年2月; 2009年9月; 2010年10月(应解析为02/01 / 2009、09 / 01/2009等)

  • 6/2008; 12/2009(应解析为06/01/2008等)。

  • 2009; 2010 (应解析为2009年1月1日和2010年1月1日)

  • 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  • Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  • 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  • Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  • Feb 2009; Sep 2009; Oct 2010 (shall be parsed to 02/01/2009, 09/01/2009 etc)
  • 6/2008; 12/2009 (shall be parsed to 06/01/2008 etc).
  • 2009; 2010 (shall be parsed to 01/01/2009 and 01/01/2010)

>正则表达式可以解救!

请输入以下表达式:

(((0?[1-9]|1[0-2])((\/)|(-)))?(((0?[1-9]|[1-2][0-9]|3[0-1])((\/)|(-))))((19[0-9][0-9])|(20[0-1]{1}[0-9])|([0-9][0-9]))|((19[0-9][0-9])|(20[0-1]{1}[0-9])))|((0[1-9])|(1[0-9])|(2[0-9])|(3[0-1]))?(\D)?(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)((\s|\.|-)((19[0-9][0-9])|(20[0-9][0-9])))

我可以用对于所有用例 Regex101

I was able to debug it with Regex101 for all use cases.

但是,当我尝试使用下面的代码在Pandas数据框上运行它时,在某些情况下找不到匹配项-( df代表Pandas数据框,其中每一行都包含带有日期的原始文本以上格式之一)

However, when I try to run it over a Pandas dataframe using the code below, no matches are found for some of the cases - ("df" stands for a Pandas dataframe where each of the rows contains raw text with a date in one of the formats above)

import re

pattern = '(((0?[1-9]|1[0-2])((\/)|(-)))?(((0?[1-9]|[1-2][0-9]|3[0-1])((\/)|(-))))((19[0-9][0-9])|(20[0-1]{1}[0-9])|([0-9][0-9]))|((19[0-9][0-9])|(20[0-1]{1}[0-9])))|((0[1-9])|(1[0-9])|(2[0-9])|(3[0-1]))?(\D)?(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)((\s|\.|-)((19[0-9][0-9])|(20[0-9][0-9])))'

flags = re.IGNORECASE

m = df.str.extract(pattern, flags)

不匹配的情况包括:


  1. A 1977年2月:Symmes Hospital\ n

  2. NV消防员在工作时 2007年9月死亡。从部署到圣马力诺和多年的培训是朋友。仍然困扰着pt。没去参加葬礼精神/宗教:\n

  3. 的凯茜·鲍尔斯(Cathy Bowers)是一位50岁的单身白种女性,她向ANH饮食失调症科提出了有关低体重的评估和治疗建议。她分享说自己最近体重减轻了很多,并且由于作呕/吞咽困难以及对特定食物的厌恶感而难以满足自己的卡路里需求。具体来说,自 2012年5月以来,她在5英尺高的地方体重减轻了18磅,从128磅(BMI = 19.5,正常范围)降至110.2磅(BMI = 16.8,体重不足)。身高8英寸。她有闭经2个月。目前的体重是自高中时的最低体重,当时她是模特,体重98磅(体重指数= 14.9,体重不足)。那时,她患有闭经,感到压力为了保持工作而变瘦,并且很可能满足了限制神经性坦率厌食症的标准。\n'

  1. AFeb 1977: Symmes Hospital\n
  2. "NV fire fighter died Sep 2007 while working. Was friend from deployment to San Marino and trainings for years prior. Still troubling to pt. Didn't go to his funeral. Spiritual/Religion:\n
  3. 's Cathy Bowers is a 50 yo single Caucasian female who presents to the ANH Eating Disorders Department for an evaluation and treatment recommendations for low weight. She shared that she has recently lost a great deal of weight and is having difficulty meeting her calorie needs due to difficulties with gagging/swallowing, and aversions to specific food textures. Specifically, since May 2012, she has lost 18 lbs, going from 128 lbs (BMI = 19.5, normal range) to 110.2 lbs (BMI = 16.8, underweight range) at a height of 5\'8" tall. She has had amenorrhea for 2 months. Her current weight is her lowest since high school, when she was a model and weighed 98 lbs (BMI = 14.9, underweight range). At that time, she had amenorrhea, felt pressure to be thin in order to keep her job, and most likely met criteria for frank anorexia nervosa nervosa-restricting type.\n'

对于所有这些情况,我都能够正确调试表达式并在Reg101上对其进行验证。

For all of these cases, I was able to properly debug the expression and validate them on Reg101.

这使我认为Python解析器/之间可能不匹配Reg101所使用的版本以及我正在使用的Python版本(3)-也许是我不知道的参数。

This makes me think that maybe there's a mismatch between the Python parser/version used by Reg101 and the Python version I'm using (3) - or maybe a parameter that I'm not aware of.

有人知道吗?

谢谢!

推荐答案

代码



在此处查看正在使用的正则表达式

\d+/\d+(?:/\d+)?|(?:\d+ )?(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[.,]?(?:-\d+-\d+| \d+(?:th|rd|st|nd)?,? \d+| \d+)|\d{4}






结果



输入




Results

Input


2009年4月20日; 09年4月20日; 09/4/20; 2009年4月3日至2009年3月20日; 2009年3月20日; 2009年3月
20; 2009年3月20日; 2009年3月20日; 2009年3月20日; 2009年3月20日; 20
2009年3月; 2009年3月20日; 2009年3月20日; 2009年3月21日; 3月22日,
2009年2月2009年; 2009年9月; 2010年10月(应解析为02/01/2009,
09/01/2009等)6/2008; 12/2009(应解析为06/01/2008等)。
2009; 2010年(应分别解析为01/01/2009和01/01/2010)AFeb 1977:
Symmes Hospital\n NV消防员在工作时于2007年9月死亡。是
的朋友,从部署到圣马力诺和多年的培训以来。
仍然困扰着pt。没去参加葬礼精神/宗教:
的Cathy Bowers是50岁的单身白种女性,她向
ANH饮食失调症部提出评估和治疗
的低体重建议。她分享说,她最近体重减轻了
,并且由于作呕/吞咽困难以及对
特定食物的厌恶感而难以满足她的卡路里
的需求。具体来说,自2012年5月以来,她在5英尺高的位置损失了128磅(BMI = 19.5,正常范围)到110.2磅(BMI =
16.8,体重不足范围),损失了18美元身高'8英寸。她已经闭经2个月了。她目前的体重是自高中
学校以来的最低水平,当时她是模特,体重为98磅(体重指数= 14.9,体重不足
)。那个时候,她患有闭经,感到为了保持工作而不得不减薄
的压力,并且很可能满足了
坦白神经性厌食症限制型的标准。

04/20/2009; 04/20/09; 4/20/09; 4/3/09 Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 Feb 2009; Sep 2009; Oct 2010 (shall be parsed to 02/01/2009, 09/01/2009 etc) 6/2008; 12/2009 (shall be parsed to 06/01/2008 etc). 2009; 2010 (shall be parsed to 01/01/2009 and 01/01/2010) AFeb 1977: Symmes Hospital\n NV fire fighter died Sep 2007 while working. Was friend from deployment to San Marino and trainings for years prior. Still troubling to pt. Didn't go to his funeral. Spiritual/Religion: 's Cathy Bowers is a 50 yo single Caucasian female who presents to the ANH Eating Disorders Department for an evaluation and treatment recommendations for low weight. She shared that she has recently lost a great deal of weight and is having difficulty meeting her calorie needs due to difficulties with gagging/swallowing, and aversions to specific food textures. Specifically, since May 2012, she has lost 18 lbs, going from 128 lbs (BMI = 19.5, normal range) to 110.2 lbs (BMI = 16.8, underweight range) at a height of 5\'8" tall. She has had amenorrhea for 2 months. Her current weight is her lowest since high school, when she was a model and weighed 98 lbs (BMI = 14.9, underweight range). At that time, she had amenorrhea, felt pressure to be thin in order to keep her job, and most likely met criteria for frank anorexia nervosa nervosa-restricting type.



输出



下面仅显示匹配项。

Output

Below shows matches only.

04/20/2009
04/20/09
4/20/09
4/3/09
Mar-20-2009
Mar 20, 2009
March 20, 2009
Mar. 20, 2009
Mar 20 2009
20 Mar 2009
20 March 2009
20 Mar. 2009
20 March, 2009
Mar 20th, 2009
Mar 21st, 2009
Mar 22nd, 2009
Feb 2009
Sep 2009
Oct 2010
02/01/2009
09/01/2009
6/2008
12/2009
06/01/2008
2009
2010
01/01/2009
01/01/2010
Feb 1977
Sep 2007
May 2012






说明




  • 匹配以下任一选项


    • \d + / \d +(?:/ \d +)?匹配一个或多个数字,后跟 / 后跟一个或多个数字,然后可能是另一个具有一个或多个数字的 /

    • (?: \d +)?(?: Jan(?:uary)?| Feb(?:ruary)?| Mar(?:ch)?| Apr(?:il)? |五月|六月?|七月?| Aug(?:ust)?| Sep(?:tember)?| Oct(?:ober)?| Nov(?:ember)?| Dec(?:ember)?)[ 。,]?(?:-\d + -\d + | \d +(?: th | rd | st | nd)?,? \d + | \d +)匹配一个或多个数字的可能性,后跟一个空格,然后是月份名称(或其简称),然后是一个点号或逗号,后跟-个数字-个数字; 空格 位,可能有 th rd st nd 以及以下逗号的可能性,空格和更多数字; 一个空格后跟一个数字

    • \d {4} 匹配任意数字4次(这是单个年份,但可能会捕获其他有效数字,因此您可能需要根据需要进行更改。将单词边界添加为 \b\d {4} \b 可能是不错的第一步。


    • Explanation

      • Match either of the following options
        • \d+/\d+(?:/\d+)? Match one or more digits followed by / followed by one or more digits, followed by the possibility of another / with one or more digits
        • (?:\d+ )?(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[.,]?(?:-\d+-\d+| \d+(?:th|rd|st|nd)?,? \d+| \d+) Match a possibility of one or more digits followed by a space, followed by month names (or their short forms), followed by the possibility of a dot . or comma ,, followed by either - digits - digits; or space digits with the possibility of th, rd, st, or nd and the possibility of a following comma, then a space and more digits; or a space followed by a digit
        • \d{4} Match any digit 4 times (this is for single years, but may catch other valid numbers, you may need to change this to your needs. Adding word boundaries as \b\d{4}\b might be a good first step.
        • 这篇关于适用于各种日期的Python / Pandas正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆