R Tidyr regex:从字符列中提取有序数字 [英] R tidyr regex: extract ordered numbers from character column
问题描述
假设我有一个像这样的数据框
Suppose I have a data frame like this
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
看起来像这样
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
是否有一种简单的方法,可能使用Tidyverse
提取每一行的可视化数量和文件数量?当没有可视化(或没有数据文件,或两者都不存在)时,我想提取0
.本质上,我希望最终结果是这样的
Is there a simple way, possibly using the Tidyverse
to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0
. Essentially I would like the final result to be like this
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
我尝试使用类似的东西
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
但是我很迷茫.
推荐答案
我们可以在str_extract
中使用正则表达式环顾四周,以提取一个或多个数字(\\d+
),后跟一个空格以及"vis"或数据文件"分为两列
We can use regex lookaround in str_extract
to extract one or more digits (\\d+
) followed by a space and 'vis' or 'data files' into two columns
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
在第一种情况下,该模式匹配一个或多个数字(\\d+
),后跟一个正则表达式环顾四周((?=
),其中有一个空格,后跟'vis'单词,在第二列中,它提取数字,后跟空格和单词文件"或文件"
In the first case, the pattern matches one or more digits (\\d+
) followed by a regex lookaround ((?=
) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'
这篇关于R Tidyr regex:从字符列中提取有序数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!