从R中的文本解析日期 [英] Parsing Dates from Text in R

查看:67
本文介绍了从R中的文本解析日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我反复遇到这个问题,以从相对非结构化的文本文档中解析日期,该日期中嵌入了日期,并且其位置和格式因情况而异.一些示例文本是:

I repeatedly come across the problem to parse dates from relatively unstructured text documents where the date is embedded in the text and its position and format varies from case to case. Some example text is:

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

我想从文本中提取日期字符串"July 1st, 2015"(步骤1)并将其转换为类似2015-07-01 UTC的格式(步骤2).例如,可以使用软件包lubridate中的parse_date_time执行第二步(这对于多种适用的日期格式非常有用):

I would like to extract the date string "July 1st, 2015" from the text (step 1) and convert it to a format like, for example, 2015-07-01 UTC (step 2). Step 2 can be performed using, for example, parse_date_time from package lubridate (which is nice for multiple applicable date formats):

案例1:

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

在某些情况下,parse_date_time还可用于包含日期的较大字符串.例如:

For some cases parse_date_time also works on larger strings which include the date. For example:

案例2:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

但是,据我所知,第2步无法直接在完整的示例文本上进行:

However, as far as I understand it, step 2 does not work directly on the full example text:

案例3:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

显然,文本中的某些其他信息使直接从全文中解析日期变得很麻烦.我可以想到一种方法,其中使用正则表达式执行步骤1以提取精简的字符串(类似于案例1或案例2),该字符串包含日期并且parse_date_time适用.但是,将正则表达式与日期结合使用似乎总是有点脏,因为正则表达式不知道它是否提取有效日期.

Apparently, some of the additional information in the text makes it cumbersome to parse the date directly from the full text. I can think of an approach where step 1 is performed using regex to extract a reduced string (similar to Case 1 or Case 2) that includes the date and for which parse_date_time works. However, using regex in connection with dates seems always a bit dirty as regex does not know whether it extracts a valid date.

是否可以像上述示例(案例3)中那样,对非结构化文本直接执行第2步(即,无需基于正则表达式的变通方法)?

Is there a way to directly perform step 2 (i.e., without a workaround based on regex) on unstructured texts as in the above example (Case 3)?

非常感谢任何输入!

推荐答案

使用此 网站,我们可以构建一些正则表达式代码:(( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+),但是 在R中不起作用...:(

Using this website, we can construct some regex code: (( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+) but it doesn't work in R... :(

如果更正,它确实可以工作.

It does work if corrected.

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"

这篇关于从R中的文本解析日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆