使用Regex捕获多种格式的日期 [英] Using Regex to catch dates in many formats

查看:62
本文介绍了使用Regex捕获多种格式的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个应用程序,该应用程序会刮擦本地网站以创建即将发生的事件的数据库,并且我正尝试使用Regex来捕获尽可能多的日期格式。

I'm working on an app which scrapes local websites to create a database of upcoming events, and I'm trying to use Regex to catch as many formats of dates as possible.

请考虑以下句子片段:


  • 研讨会的重点, 2013年2月2日(星期六)将是[...]

  • 情人节特别@丽笙酒店,2月14日

  • 15日(星期五) 2月,特别的好莱坞主题[...]

  • 2月8日星期五的儿童游戏研讨会

  • 举办工艺品3月9日至11日在旧的[...]研讨会

  • "The focus of the seminar, on Saturday 2nd February 2013 will be [...]"
  • "Valentines Special @ The Radisson, Feb 14th"
  • "On Friday the 15th of February, a special Hollywood themed [...]"
  • "Symposium on Childhood Play on Friday, February 8th"
  • "Hosting a craft workshop March 9th - 11th in the old [...]"

我希望能够扫描这些并捕获尽可能多的日期尽可能。目前,我正在以一种有缺陷的方式来执行此操作(我对regex并不擅长),一次又一次地通过多个regex语句,例如

I want to be able to scan these and catch as many dates as possible. At the moment I'm doing this in what is probably a flawed way (I'm not great at regex), going through several regex statements one after the other, like this

/([0-9]+?)(st|nd|rd|th) (of)? (Jan|Feb|Mar|etc)/i
/([0-9]+?)(st|nd|rd|th) (of)? (January|February|March|Etcetera)/i
/(Jan|Feb|Mar|etc) ([0-9]+?)(st|nd|rd|th)/i
/(January|February|March|Etcetera) ([0-9]+?)(st|nd|rd|th)/i

我可以将所有这些合并成一个巨大的regex语句,但是似乎必须有一种更清洁的方式在php中进行此操作,也许是第三方库之类的东西?

I could merge these all into one giant regex statement, but it seems like there must be a cleaner way of doing this in php, maybe a third-party library or something?

编辑:上面的正则表达式可能有错误-仅作为示例。

The regex above may have errors - it's only meant as an example.

推荐答案

我编写了一个函数,该函数使用 strtotime()

I wrote a function which extracts dates out of text by using strtotime():

function parse_date_tokens($tokens) {
  # only try to extract a date if we have 2 or more tokens
  if(!is_array($tokens) || count($tokens) < 2) return false;
  return strtotime(implode(" ", $tokens));
}

function extract_dates($text) {
  static $patterns = Array(
    '/^[0-9]+(st|nd|rd|th|)?$/i', # day
    '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
    '/^20[0-9]{2}$/', # year
    '/^of$/' #words
  );
  # defines which of the above patterns aren't actually part of a date
  static $drop_patterns = Array(
    false,
    false,
    false,
    true
  );
  $tokens = Array();
  $result = Array();
  $text = str_word_count($text, 1, '0123456789'); # get all words in text

  # iterate words and search for matching patterns
  foreach($text as $word) {
    $found = false;
    foreach($patterns as $key => $pattern) {
      if(preg_match($pattern, $word)) {
        if(!$drop_patterns[$key]) {
          $tokens[] = $word;
        }
        $found = true;
        break;
      }
    }

    if(!$found) {
      $result[] = parse_date_tokens($tokens);
      $tokens = Array();
    }
  }
  $result[] = parse_date_tokens($tokens);

  return array_filter($result);
}

# test
$texts = Array(
  "The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
  "Valentines Special @ The Radisson, Feb 14th",
  "On Friday the 15th of February, a special Hollywood themed [...]",
  "Symposium on Childhood Play on Friday, February 8th",
  "Hosting a craft workshop March 9th - 11th in the old [...]"
);

$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
  echo "  " . date('d.m.Y H:i:s', $date) . "\n";
}

此输出:

Dates: 
  02.02.2013 00:00:00
  14.02.2013 00:00:00
  15.02.2013 00:00:00
  08.02.2013 00:00:00
  09.03.2013 00:00:00

此解决方案可能并不完美,当然也存在缺陷,但这是解决问题的非常简单的方法。

This solution may not be perfect and certainly has its flaws but it's a quite simple solution for your problem.

这篇关于使用Regex捕获多种格式的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆