R Tidyr regex:从字符列中提取有序数字 [英] R tidyr regex: extract ordered numbers from character column

查看:83
本文介绍了R Tidyr regex:从字符列中提取有序数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个像这样的数据框

Suppose I have a data frame like this

df <- data.frame(x=c("This script outputs 10 visualizations.", 
                     "This script outputs 1 visualization.", 
                     "This script outputs 5 data files.", 
                     "This script outputs 1 data file.", 
                     "This script doesn't output any visualizations or data files", 
                     "This script outputs 9 visualizations and 28 data files.", 
                     "This script outputs 1 visualization and 1 data file."))

看起来像这样

                                                            x
1                      This script outputs 10 visualizations.
2                        This script outputs 1 visualization.
3                           This script outputs 5 data files.
4                            This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6     This script outputs 9 visualizations and 28 data files.
7        This script outputs 1 visualization and 1 data file.

是否有一种简单的方法,可能使用Tidyverse提取每一行的可视化数量和文件数量?当没有可视化(或没有数据文件,或两者都不存在)时,我想提取0.本质上,我希望最终结果是这样的

Is there a simple way, possibly using the Tidyverse to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0. Essentially I would like the final result to be like this

    viz   files
1    10       0
2     1       0
3     0       5
4     0       1
5     0       0
6     9      28
7     1       1

我尝试使用类似的东西

str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")

但是我很迷茫.

推荐答案

我们可以在str_extract中使用正则表达式环顾四周,以提取一个或多个数字(\\d+),后跟一个空格以及"vis"或数据文件"分为两列

We can use regex lookaround in str_extract to extract one or more digits (\\d+) followed by a space and 'vis' or 'data files' into two columns

library(dplyr)
library(stringr)
df %>% 
  transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
            files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
  mutate_all(replace_na, 0)
#  viz files
#1  10     0
#2   1     0
#3   0     5
#4   0     0
#5   0     0
#6   9    28
#7   1     0

在第一种情况下,该模式匹配一​​个或多个数字(\\d+),后跟一个正则表达式环顾四周((?=),其中有一个空格,后跟'vis'单词,在第二列中,它提取数字,后跟空格和单词文件"或文件"

In the first case, the pattern matches one or more digits (\\d+) followed by a regex lookaround ((?=) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'

这篇关于R Tidyr regex:从字符列中提取有序数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆