带有stringi/ICU的R/regex:为什么'+'被认为是非[:punct:]字符? [英] R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
问题描述
我正在尝试从字符串向量中删除非字母字符.我以为[:punct:]
分组会覆盖它,但似乎忽略了+
.这是否属于另一组字符?
I'm trying to remove non-alphabet characters from a vector of strings. I thought the [:punct:]
grouping would cover it, but it seems to ignore the +
. Does this belong to another group of characters?
library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)
string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')
推荐答案
POSIX字符类需要包装在字符类中,正确的格式应为[[:punct:]]
.请勿将POSIX术语字符类"与通常称为正则表达式字符类的混淆.
POSIX character classes need to be wrapped inside of a character class, the correct form would be [[:punct:]]
. Do not confuse the POSIX term "character class" with what is normally called a regex character class.
此POSIX命名类在ASCII范围内与所有非控件,非字母数字,非空格字符匹配.
This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.
ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
尽管有效 locale
,但它可能会更改 [[:punct:]]
...
R文档?regex
指出以下内容:某些命名的字符类是预定义的.它们的解释取决于语言环境(请参见语言环境);解释是POSIX语言环境的解释.
R Documentation ?regex
states the following: Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation is that of the POSIX locale.
开放小组针对点子的LC_TYPE定义说:
定义要归为标点符号的字符.
Define characters to be classified as punctuation characters.
在POSIX 语言环境中,
<space>
或字母,数字或cntrl类中的任何字符均不应包含在内.
In the POSIX locale, neither the
<space>
nor any characters in classes alpha, digit, or cntrl shall be included.
在语言环境定义文件中,不得为关键字upper,lower,alpha,digit,cntrl,xdigit或<space>
指定任何字符.
In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space>
shall be specified.
但是,stringi软件包似乎依赖于 ICU ,而语言环境是ICU中的基本概念
However, the stringi package seems to depend on ICU and locale is a fundamental concept in ICU.
我使用stringi包,建议使用 Unicode属性\p{P}
和\p{S}
一个>.
Using the stringi package, I recommend using the Unicode Properties \p{P}
and \p{S}
.
-
\p{P}
匹配任何类型的标点字符.也就是说,它缺少POSIX类 punct 所包含的九个字符.这是因为Unicode将POSIX认为是标点的东西分为两类,即标点和符号.这是\p{S}
出现的地方...
\p{P}
matches any kind of punctuation character. That is, it is missing nine of the characters that the POSIX class punct includes. This is because Unicode splits what POSIX considers to be punctuation into two categories, Punctuation and Symbols. This is where\p{S}
comes into place ...
stri_replace_all_regex(string1, '[\\p{P}\\p{S}]', ' ')
# [1] "this is a test" "this is also a test"
# [3] "this is the final test" "this is the final test "
或者从基数R退回到gsub
,可以很好地解决这个问题.
Or fallback to gsub
from base R which handles this very well.
gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test" "this is also a test"
# [3] "this is the final test" "this is the final test "
这篇关于带有stringi/ICU的R/regex:为什么'+'被认为是非[:punct:]字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!