使用dplyr从字符串列中提取特定单词之后的日期 [英] Extracting dates following a specific word from a column of strings using dplyr

查看：61 发布时间：2021/5/2 20:55:41 r regex dplyr gsubfn

本文介绍了使用dplyr从字符串列中提取特定单词之后的日期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试提取在报表的R数据框中添加报表的最新日期.文本始终看起来像订购日期:M/DD/YYYY ，并且在给定的报告中可能多次包含0.如果重复，我需要最新的(通常是最后一个)实例，并且我正在尝试将其转换为dplyr突变列中的日期.

I am trying to extract the most recent date that a report was added in an R dataframe of reports. The text always looks like Date Ordered: M/DD/YYYY and may contain 0 many times in a given report. If it's repeating, I want the most recent (usually the last) instance, and I'm trying to convert it to a date in a mutated dplyr column.

在我的实际数据帧上使用下面的代码，我得到了错误:

Using the code below on my actual dataframe, I get the error:

if(nchar> 0&& substring(s，1，1)=="\ 002"){:
缺少需要TRUE/FALSE的值

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed

但是，它在单个项目上运行良好，使我认为它正在尝试连接整个列.

However, it runs fine on a single item making me think that it's trying to concatenate the entire column.

测试代码没有给出错误，但实际上从所有实例的最新报告中提取了最后日期:

The test code doesn't give an error, but actually pulls the last date from the last report for all instances:

     lastdate
1 1999-04-15
2 1999-04-15

dataset=data.frame(cbind(ID=c(001,002),
                         Report=c("Blah Blah Date Ordered: 5/19/2000 test is positive. Date Ordered: 4/2/2005 Additional testing negative.",
                                  "Meh Date Ordered: 4/15/1999")),
                   stringsAsFactors = F)`

dataset %>% 
  mutate(lastdate = as.Date(last(gsub("Date Ordered:\\s+", "",
                                      strapplyc(Report, 
                                                "Date Ordered:\\s*\\d+/\\d+/\\d+", simplify = TRUE))),
                            "%m/%d/%Y"))

所需的输出应为:

2005-4-2
1999-4-15

数据集的实际值:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { : 
  missing value where TRUE/FALSE needed

实际的测试数据:

    lastdate
1 1999-04-15
2 1999-04-15

推荐答案

我建议使用 gsub 这样的

dataset$lastsdate <- as.Date(gsub(".*Date Ordered:\\s*(\\d{1,2}/\\d{1,2}/\\d{4}).*|.*","\\1", dataset$Report),"%m/%d/%Y")

请参见正则表达式在运行中.

正则表达式匹配:

.* -尽可能多0个字符
日期已排序:-文字子字符串
\ s * -0+空格
(\ d {1,2}/\ d {1,2}/\ d {4})-捕获组1( \ 1 ):1或2位数字，/，1或2位数字，/，4位数字
.* -字符串的其余部分
| -或
.* -整个字符串.

.* - any 0+ chars as many as possible
Date Ordered: - a literal substring
\s* - 0+ whitespaces
(\d{1,2}/\d{1,2}/\d{4}) - Capturing group 1 (\1): 1 or 2 digits, /, 1 or 2 digits, /, 4 digits
.* - the rest of the string
| - or
.* - the entire string.

这篇关于使用dplyr从字符串列中提取特定单词之后的日期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用dplyr从字符串列中提取特定单词之后的日期 [英] Extracting dates following a specific word from a column of strings using dplyr

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用dplyr从字符串列中提取特定单词之后的日期 [英] Extracting dates following a specific word from a column of strings using dplyr

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭