正则表达式gsub R区分省略号和句点 [英] Regex gsub R differentiate between ellipsis and periods

查看:119
本文介绍了正则表达式gsub R区分省略号和句点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

text="stack overflow... is a popular website."

我想将标点符号与单词分开.输出应为:

I want to separate punctuation marks from words. The output should be:

"stack overflow ... is a popular website . "

当然,命令gsub("\\.", " \\. ", text, fixed = FALSE)返回:

"stack overflow . . . is a popular website . ",因为它不能区分句点和省略号(悬浮点).简而言之,当在文本中同时找到三个句点时,R应将其视为单个标点符号.

"stack overflow . . . is a popular website . " because it does not differentiate between periods and ellipsis (suspension points). In short, when three periods are found together in the text, R should consider them as a single punctuation mark.

推荐答案

我认为非环顾四周的方法将更有效,更易读:

I think a non-lookaround approach will be more efficient and readable:

text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\\.+)[[:space:]]*", " \\1 ", text)
## => [1] "stack overflow ... is a popular website . "

请参见 IDEONE演示

我更新了帖子,因为标点符号之前和之后都需要空格.

I updated the post since the space is required before and after the punctuation.

(\\.+)周围的[[:space:]]*匹配零个或多个空格,而(\\.+)匹配一个或多个句点. (...)组成一个捕获组,其值存储在编号为#1的缓冲区中,我们可以使用替换模式中的\1后向引用来访问它.因此,将\1替换为该模式捕获的时间段.捕获比使用环视更有效,因为在当前位置之前/之后检查文本没有开销.

The [[:space:]]* around the (\\.+) match zero or more whitespace and the (\\.+) will match one or more periods. The (...) form a capturing group whose value is stored in a numbered buffer #1 that we can access using the \1 backreference from the replacement pattern. So, \1 is replaced with the periods captured by the pattern. Capturing is more efficient than using lookarounds since there is no overhead of checking text before/after the current position.

现在,如果您需要处理所有标点符号,请使用 [[:punct:]] :

Now, if you need to handle all punctuation, use [[:punct:]]:

gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)

请参见 R regex帮助 :

[:punct:]
标点符号:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

代码演示:

text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "

更新缩略词

为避免匹配带连字符的单词,可以匹配并跳过用单词边界包围的-:

UPDATE FOR HYPHENATED WORDS

To avoid matching hyphenated words, you can match and skip the - that are surrounded with word boundaries:

text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\\b-\\b(*SKIP)(*F)|\\s*(\\p{P}+)\\s*", " \\1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "

请参见演示

这篇关于正则表达式gsub R区分省略号和句点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆