R dplyr:通过向量定义的多个正则表达式表达式过滤数据 [英] R dplyr: Filter data by multiple Regex expressions defined by vector
问题描述
我有一个数据框,我想从中选择重要的列,然后过滤行以包含特定的结尾。
I have a dataframe, from which I want to select important columns, and then filter the rows to contain specific ending.
正则表达式使定义我的行变得简单最终值使用 xx $
符号。但是,如何在多个可能的结尾( xx $,yy $
)之间变化?
Regex expression make it simple to define my ending value using xx$
symbol. But, how to vary over multiple possible endings (xx$, yy$
)?
虚拟示例:
require(dplyr)
x <- c("aa", "aa", "aa", "bb", "cc", "cc", "cc")
y <- c(101, 102, 113, 201, 202, 344, 407)
type = rep("zz", 7)
df = data.frame(x, y, type)
# Select all expressions that starts end by "7"
df %>%
select(x, y) %>%
filter(grepl("7$", y))
# It seems working when I explicitly define my variables, but I need to use it as a vector instead of values?
df %>%
select(x, y) %>%
filter(grepl("[2|7]$", y)) # need to modify this using multiple endings
# How to modify this expression, to use vector of endings (ids) instead?
ids = c(7,2) # define vector of my values
df %>%
select(x, y) %>%
filter(grepl("ids$", y)) # how to change "grepl(ids, y)??"
预期输出:
x y type
1 aa 102 zz
2 cc 202 zz
3 cc 407 zz
基于此问题的示例:正则表达式(RegEx)和dplyr :: filter()
推荐答案
您可以使用
df %>%
select(x, y) %> filter(grepl(paste0("(?:", paste(ids, collapse="|"), ")$"), y))
paste0((?:,paste(ids,crash = |),)$)
部分将构建一个交替模式,该模式仅在字符串的末尾匹配,这是由于末尾的 $
锚点。
The paste0("(?:", paste(ids, collapse="|"), ")$")
part will build an alternation pattern that will only match at the end of the string due to $
anchor at the end.
注意:如果值可以具有特殊的正则表达式元字符,则需要先转义字符向量中的值:
NOTE: If the values can have special regex metacharacters you need to escape the values in the character vector first:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
df %>%
select(x, y) %> filter(grepl(paste0("(?:", paste(regex.escape(ids), collapse="|"), ")$"), y))
^^^^^^^^^^^^^^^^^
例如, paste0(( ?:,paste(c( 7, 8, ids),collapse = |),)$)
将输出 (?: 7 | 8 | ids)$
:
-
(?:
-一个非捕获组的开始,它将充当替代方案的容器,因此$
锚不仅适用于所有锚,还适用于所有锚,与任何 -
-
7
-a7
char
匹配
(?:
- start of a non-capturing group that will act as a container for the alternatives, so that the$
anchor applied to all of them and not to just the last one, matching any of7
- a7
char
这篇关于R dplyr:通过向量定义的多个正则表达式表达式过滤数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-