R:提取包含在两个子字符串之间的所有子字符串的最快方法 [英] R: fastest way to extract all substrings contained between two substrings
问题描述
我正在寻找一种有效的方法来提取字符串中两个子字符串之间的所有匹配项.例如.说我想提取字符串之间包含的所有子字符串
I am on the lookout for an efficient way to extract all matches between two substrings in a character string. E.g. say I want to extract all substrings contained between string
start="strt"
和
stop="stp"
in string
x="strt111stpblablastrt222stp"
我想要矢量
"111" "222"
在 R 中执行此操作的最有效方法是什么?也许使用正则表达式?或者有更好的方法吗?
What is the most efficient way to do this in R? Using a regular expression perhaps? Or are there better ways?
推荐答案
对于像这样简单的事情,base R 处理得很好.
For something simple like this, base R handles this just fine.
您可以使用 perl=T 打开 PCRE
并使用 lookaround 断言.
You can switch on PCRE by using perl=T
and use lookaround assertions.
x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"
说明:
(?<= # look behind to see if there is:
strt # 'strt'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
stp # 'stp'
) # end of look-ahead
根据新语法更新了以下答案.
Updated below answers according to the new syntax.
您也可以考虑使用 stringi 包.
You may also consider using the stringi package.
library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"
和 rm_between
来自 qdapRegex 包.
library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"
这篇关于R:提取包含在两个子字符串之间的所有子字符串的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!