使用Regex捕获多种格式的日期 [英] Using Regex to catch dates in many formats
问题描述
我正在开发一个应用程序,该应用程序会刮擦本地网站以创建即将发生的事件的数据库,并且我正尝试使用Regex来捕获尽可能多的日期格式。
I'm working on an app which scrapes local websites to create a database of upcoming events, and I'm trying to use Regex to catch as many formats of dates as possible.
请考虑以下句子片段:
- 研讨会的重点, 2013年2月2日(星期六)将是[...]
- 情人节特别@丽笙酒店,2月14日
- 15日(星期五) 2月,特别的好莱坞主题[...]
- 2月8日星期五的儿童游戏研讨会
- 举办工艺品3月9日至11日在旧的[...]研讨会
- "The focus of the seminar, on Saturday 2nd February 2013 will be [...]"
- "Valentines Special @ The Radisson, Feb 14th"
- "On Friday the 15th of February, a special Hollywood themed [...]"
- "Symposium on Childhood Play on Friday, February 8th"
- "Hosting a craft workshop March 9th - 11th in the old [...]"
我希望能够扫描这些并捕获尽可能多的日期尽可能。目前,我正在以一种有缺陷的方式来执行此操作(我对regex并不擅长),一次又一次地通过多个regex语句,例如
I want to be able to scan these and catch as many dates as possible. At the moment I'm doing this in what is probably a flawed way (I'm not great at regex), going through several regex statements one after the other, like this
/([0-9]+?)(st|nd|rd|th) (of)? (Jan|Feb|Mar|etc)/i
/([0-9]+?)(st|nd|rd|th) (of)? (January|February|March|Etcetera)/i
/(Jan|Feb|Mar|etc) ([0-9]+?)(st|nd|rd|th)/i
/(January|February|March|Etcetera) ([0-9]+?)(st|nd|rd|th)/i
我可以将所有这些合并成一个巨大的regex语句,但是似乎必须有一种更清洁的方式在php中进行此操作,也许是第三方库之类的东西?
I could merge these all into one giant regex statement, but it seems like there must be a cleaner way of doing this in php, maybe a third-party library or something?
编辑:上面的正则表达式可能有错误-仅作为示例。
The regex above may have errors - it's only meant as an example.
推荐答案
我编写了一个函数,该函数使用 strtotime()
:
I wrote a function which extracts dates out of text by using strtotime()
:
function parse_date_tokens($tokens) {
# only try to extract a date if we have 2 or more tokens
if(!is_array($tokens) || count($tokens) < 2) return false;
return strtotime(implode(" ", $tokens));
}
function extract_dates($text) {
static $patterns = Array(
'/^[0-9]+(st|nd|rd|th|)?$/i', # day
'/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
'/^20[0-9]{2}$/', # year
'/^of$/' #words
);
# defines which of the above patterns aren't actually part of a date
static $drop_patterns = Array(
false,
false,
false,
true
);
$tokens = Array();
$result = Array();
$text = str_word_count($text, 1, '0123456789'); # get all words in text
# iterate words and search for matching patterns
foreach($text as $word) {
$found = false;
foreach($patterns as $key => $pattern) {
if(preg_match($pattern, $word)) {
if(!$drop_patterns[$key]) {
$tokens[] = $word;
}
$found = true;
break;
}
}
if(!$found) {
$result[] = parse_date_tokens($tokens);
$tokens = Array();
}
}
$result[] = parse_date_tokens($tokens);
return array_filter($result);
}
# test
$texts = Array(
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special @ The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
);
$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
echo " " . date('d.m.Y H:i:s', $date) . "\n";
}
此输出:
Dates:
02.02.2013 00:00:00
14.02.2013 00:00:00
15.02.2013 00:00:00
08.02.2013 00:00:00
09.03.2013 00:00:00
此解决方案可能并不完美,当然也存在缺陷,但这是解决问题的非常简单的方法。
This solution may not be perfect and certainly has its flaws but it's a quite simple solution for your problem.
这篇关于使用Regex捕获多种格式的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!