strsplit 与 gregexpr 不一致 [英] strsplit inconsistent with gregexpr
问题描述
评论我对这个问题的回答strsplit
不会,即使它似乎正确匹配字符向量中的第一个和最后一个逗号.这可以使用 gregexpr
和 regmatches
来证明.
A comment on my answer to this question which should give the desired result using strsplit
does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr
and regmatches
.
那么为什么在这个例子中 strsplit
在每个逗号上拆分,即使 regmatches
只返回 same 正则表达式的两个匹配项?
So why does strsplit
split on each comma in this example, even though regmatches
only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
<小时>
嗯?!这是怎么回事?
Huh?! What is going on?
推荐答案
@Aprillion 的理论是准确的,来自 R 文档:
The theory of @Aprillion is exact, from R documentation:
应用于每个输入字符串的算法是
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
换句话说,在每次迭代时,^
将匹配一个新字符串的开头(没有前面的项目.)
In other words, at each iteration ^
will match the begining of a new string (without the precedent items.)
简单地说明这种行为:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
此处,您可以使用前瞻断言作为分隔符查看此行为的后果(感谢@JoshO'Brien链接.)
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to @JoshO'Brien for the link.)
这篇关于strsplit 与 gregexpr 不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!