使用dplyr从字符串列中提取特定单词之后的日期 [英] Extracting dates following a specific word from a column of strings using dplyr
问题描述
我正在尝试提取在报表的R数据框中添加报表的最新日期.文本始终看起来像订购日期:M/DD/YYYY
,并且在给定的报告中可能多次包含0.如果重复,我需要最新的(通常是最后一个)实例,并且我正在尝试将其转换为dplyr突变列中的日期.
I am trying to extract the most recent date that a report was added in an R dataframe of reports. The text always looks like Date Ordered: M/DD/YYYY
and may contain 0 many times in a given report. If it's repeating, I want the most recent (usually the last) instance, and I'm trying to convert it to a date in a mutated dplyr column.
在我的实际数据帧上使用下面的代码,我得到了错误:
Using the code below on my actual dataframe, I get the error:
if(nchar> 0&& substring(s,1,1)=="\ 002"){:
缺少需要TRUE/FALSE的值
Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed
但是,它在单个项目上运行良好,使我认为它正在尝试连接整个列.
However, it runs fine on a single item making me think that it's trying to concatenate the entire column.
测试代码没有给出错误,但实际上从所有实例的最新报告中提取了最后日期:
The test code doesn't give an error, but actually pulls the last date from the last report for all instances:
lastdate
1 1999-04-15
2 1999-04-15
dataset=data.frame(cbind(ID=c(001,002),
Report=c("Blah Blah Date Ordered: 5/19/2000 test is positive. Date Ordered: 4/2/2005 Additional testing negative.",
"Meh Date Ordered: 4/15/1999")),
stringsAsFactors = F)`
dataset %>%
mutate(lastdate = as.Date(last(gsub("Date Ordered:\\s+", "",
strapplyc(Report,
"Date Ordered:\\s*\\d+/\\d+/\\d+", simplify = TRUE))),
"%m/%d/%Y"))
所需的输出应为:
2005-4-2
1999-4-15
数据集的实际值:
Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed
实际的测试数据:
lastdate
1 1999-04-15
2 1999-04-15
推荐答案
我建议使用 gsub
这样的
dataset$lastsdate <- as.Date(gsub(".*Date Ordered:\\s*(\\d{1,2}/\\d{1,2}/\\d{4}).*|.*","\\1", dataset$Report),"%m/%d/%Y")
请参见正则表达式在运行中.
正则表达式匹配:
-
.*
-尽可能多0个字符 -
日期已排序:
-文字子字符串 -
\ s *
-0+空格 -
(\ d {1,2}/\ d {1,2}/\ d {4})
-捕获组1(\ 1
):1或2位数字,/
,1或2位数字,/
,4位数字 -
.*
-字符串的其余部分 -
|
-或 -
.*
-整个字符串.
.*
- any 0+ chars as many as possibleDate Ordered:
- a literal substring\s*
- 0+ whitespaces(\d{1,2}/\d{1,2}/\d{4})
- Capturing group 1 (\1
): 1 or 2 digits,/
, 1 or 2 digits,/
, 4 digits.*
- the rest of the string|
- or.*
- the entire string.
这篇关于使用dplyr从字符串列中提取特定单词之后的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!