带有stringi/ICU的R/regex:为什么'+'被认为是非[:punct:]字符? [英] R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

查看:130
本文介绍了带有stringi/ICU的R/regex:为什么'+'被认为是非[:punct:]字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从字符串向量中删除非字母字符.我以为[:punct:]分组会覆盖它,但似乎忽略了+.这是否属于另一组字符?

I'm trying to remove non-alphabet characters from a vector of strings. I thought the [:punct:] grouping would cover it, but it seems to ignore the +. Does this belong to another group of characters?

library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)

string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')

推荐答案

POSIX字符类需要包装在字符类中,正确的格式应为[[:punct:]].请勿将POSIX术语字符类"与通常称为正则表达式字符类的混淆.

POSIX character classes need to be wrapped inside of a character class, the correct form would be [[:punct:]]. Do not confuse the POSIX term "character class" with what is normally called a regex character class.

此POSIX命名类在ASCII范围内与所有非控件非字母数字非空格字符匹配.

This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.

ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"

尽管有效 locale ,但它可能会更改 [[:punct:]] ...

R文档?regex指出以下内容:某些命名的字符类是预定义的.它们的解释取决于语言环境(请参见语言环境);解释是POSIX语言环境的解释.

R Documentation ?regex states the following: Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation is that of the POSIX locale.

开放小组针对点子的LC_TYPE定义说:

定义要归为标点符号的字符.

Define characters to be classified as punctuation characters.

在POSIX 语言环境中,<space>或字母,数字或cntrl类中的任何字符均不应包含在内.

In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included.

在语言环境定义文件中,不得为关键字upper,lower,alpha,digit,cntrl,xdigit或<space>指定任何字符.

In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space> shall be specified.


但是,stringi软件包似乎依赖于 ICU ,而语言环境是ICU中的基本概念


However, the stringi package seems to depend on ICU and locale is a fundamental concept in ICU.

我使用stringi包,建议使用 Unicode属性\p{P}\p{S} .

Using the stringi package, I recommend using the Unicode Properties \p{P} and \p{S}.

  • \p{P}匹配任何类型的标点字符.也就是说,它缺少POSIX类 punct 所包含的九个字符.这是因为Unicode将POSIX认为是标点的东西分为两类,即标点符号.这是\p{S}出现的地方...

  • \p{P} matches any kind of punctuation character. That is, it is missing nine of the characters that the POSIX class punct includes. This is because Unicode splits what POSIX considers to be punctuation into two categories, Punctuation and Symbols. This is where \p{S} comes into place ...

stri_replace_all_regex(string1, '[\\p{P}\\p{S}]', ' ')
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "

  • 或者从基数R退回到gsub,可以很好地解决这个问题.

  • Or fallback to gsub from base R which handles this very well.

    gsub('[[:punct:]]', ' ', string1)
    # [1] "this is a test"            "this  is also a test"     
    # [3] "this is the final  test"   "this is the final   test "
    

  • 这篇关于带有stringi/ICU的R/regex:为什么'+'被认为是非[:punct:]字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆