R中的正则表达式:仅替换模式的一部分 [英] Regex in R: replace only part of a pattern
问题描述
s <- "YXABCDXABCDYX"
我想使用正则表达式返回ABCDABCD
,即中央"X"
的每侧各4 个字符,但不包括"X"代码>.请注意,
"X"
始终位于中心,每侧有 6 个字母.
我可以找到中心模式,例如"[AZ]{4}X[AZ]{4}"
,但是我可以以某种方式让返回成为 "([AZ]{4})(X)([AZ]{4})"
?
你的正则表达式 "([AZ]{4})(X)([AZ]{4})"
不会匹配您的字符串,因为您在第一个捕获组 ([AZ]{4})
之前有字符,因此您可以添加 .*
以匹配任何字符 (.
) 0 次或更多次 (*
) 直到您的第一个捕获组.
您可以引用 gsub
中的组,例如,使用 \\n
其中 n 是第 n 个捕获组
s <- "YXABCDXABCDYX"gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)# [1] "ABCDABCD"
这基本上匹配整个字符串并将其替换为在组 1 和组 3 中捕获的任何内容并将其粘贴在一起.
另一种方法是使用不区分大小写的 (?i)
与 [az]
或 \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)# [1] "ABCDABCD"
或者 gsub('.*(.{4})X(.{4}).*', '\\1\\2', s)
如果你喜欢点>
s <- "YXABCDXABCDYX"
I want to use a regular expression to return ABCDABCD
, i.e. 4 characters on each side of central "X"
but not including the "X"
.
Note that "X"
is always in the center with 6 letters on each side.
I can find the central pattern with e.g. "[A-Z]{4}X[A-Z]{4}"
, but can I somehow let the return be the first and third group in "([A-Z]{4})(X)([A-Z]{4})"
?
Your regex "([A-Z]{4})(X)([A-Z]{4})"
won't match your string since you have characters before the first capture group ([A-Z]{4})
, so you can add .*
to match any character (.
) 0 or more times (*
) until your first capture group.
You can reference the groups in gsub
, for example, using \\n
where n is the nth capture group
s <- "YXABCDXABCDYX"
gsub('.*([A-Z]{4})(X)([A-Z]{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
which is basically matching the entire string and replacing it with whatever was captured in groups 1 and 3 and pasting that together.
Another way would be to use (?i)
which is case-insensitive matching along with [a-z]
or \\w
gsub('(?i).*(\\w{4})(x)(\\w{4}).*', '\\1\\3', s)
# [1] "ABCDABCD"
Or gsub('.*(.{4})X(.{4}).*', '\\1\\2', s)
if you like dots
这篇关于R中的正则表达式:仅替换模式的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!