使用dplyr从字符串列中提取特定单词之后的日期 [英] Extracting dates following a specific word from a column of strings using dplyr

查看:61
本文介绍了使用dplyr从字符串列中提取特定单词之后的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取在报表的R数据框中添加报表的最新日期.文本始终看起来像订购日期:M/DD/YYYY ,并且在给定的报告中可能多次包含0.如果重复,我需要最新的(通常是最后一个)实例,并且我正在尝试将其转换为dplyr突变列中的日期.

I am trying to extract the most recent date that a report was added in an R dataframe of reports. The text always looks like Date Ordered: M/DD/YYYY and may contain 0 many times in a given report. If it's repeating, I want the most recent (usually the last) instance, and I'm trying to convert it to a date in a mutated dplyr column.

在我的实际数据帧上使用下面的代码,我得到了错误:

Using the code below on my actual dataframe, I get the error:

if(nchar> 0&& substring(s,1,1)=="\ 002"){:
缺少需要TRUE/FALSE的值

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed

但是,它在单个项目上运行良好,使我认为它正在尝试连接整个列.

However, it runs fine on a single item making me think that it's trying to concatenate the entire column.

测试代码没有给出错误,但实际上从所有实例的最新报告中提取了最后日期:

The test code doesn't give an error, but actually pulls the last date from the last report for all instances:

     lastdate
1 1999-04-15
2 1999-04-15

dataset=data.frame(cbind(ID=c(001,002),
                         Report=c("Blah Blah Date Ordered: 5/19/2000 test is positive. Date Ordered: 4/2/2005 Additional testing negative.",
                                  "Meh Date Ordered: 4/15/1999")),
                   stringsAsFactors = F)`

dataset %>% 
  mutate(lastdate = as.Date(last(gsub("Date Ordered:\\s+", "",
                                      strapplyc(Report, 
                                                "Date Ordered:\\s*\\d+/\\d+/\\d+", simplify = TRUE))),
                            "%m/%d/%Y"))

所需的输出应为:

2005-4-2
1999-4-15

数据集的实际值:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { : 
  missing value where TRUE/FALSE needed

实际的测试数据:

    lastdate
1 1999-04-15
2 1999-04-15

推荐答案

我建议使用 gsub 这样的

dataset$lastsdate <- as.Date(gsub(".*Date Ordered:\\s*(\\d{1,2}/\\d{1,2}/\\d{4}).*|.*","\\1", dataset$Report),"%m/%d/%Y")

请参见正则表达式在运行中.

正则表达式匹配:

  • .* -尽可能多0个字符
  • 日期已排序:-文字子字符串
  • \ s * -0+空格
  • (\ d {1,2}/\ d {1,2}/\ d {4})-捕获组1( \ 1 ):1或2位数字,/,1或2位数字,/,4位数字
  • .* -字符串的其余部分
  • | -或
  • .* -整个字符串.
  • .* - any 0+ chars as many as possible
  • Date Ordered: - a literal substring
  • \s* - 0+ whitespaces
  • (\d{1,2}/\d{1,2}/\d{4}) - Capturing group 1 (\1): 1 or 2 digits, /, 1 or 2 digits, /, 4 digits
  • .* - the rest of the string
  • | - or
  • .* - the entire string.

这篇关于使用dplyr从字符串列中提取特定单词之后的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆