连续重复二元组的正则表达式 [英] Regular Expression For Consecutive Duplicate Bigrams

查看:130
本文介绍了连续重复二元组的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是此早期问题的直接扩展,该问题关于检测连续字符串中的单词(字母组合).

My question is a direct extension of this earlier question about detecting consecutive words (unigrams) in a string.

在上一个问题中,

那个与之相关

可以通过以下正则表达式检测到

:\b(\w+)\s+\1\b

could be detected via this regex: \b(\w+)\s+\1\b

在这里,我想检测连续的双字母(成对单词):

Here, I want to detect consecutive bigrams (pairs of words):

是蓝色然后又非常亮

理想情况下,我还想知道如何用单个元素替换检测到的模式(重复项),以便最终获得:

Ideally, I also want to know how to replace the detected pattern (duplicate) by a single element, so as to obtain in the end:

是蓝色,然后很亮

are blue and then very bright

(对于此应用程序,如果有关系,我在R中使用gsub)

(for this application, if it matters, I am using gsub in R)

推荐答案

此处的要点是,在某些情况下,会有重复的子字符串,其中包括较短的重复子字符串.因此,要匹配更长的匹配项,您将使用

The point here is that in some cases, there will be repeating substrings that include shorter repeated substrings. So, to match the longer ones, you would use

(\b.+\b)\1\b

(请参见 regex演示),对于那些查找较短子字符串的人,我会依靠在惰性点匹配上:

(see the regex demo) and for those to find shorter substrings, I'd rely on lazy dot matching:

(\b.+?\b)\1\b

请参见此正则表达式演示.替换字符串为\1-对捕获部分的反向引用首先与分组构造(...)匹配.

See this regex demo. The replacement string will be \1 - the backreference to the captured part matched first with the grouping construct (...).

您需要一个PCRE正则表达式来使其正常工作,因为存在记录的问题,这些问题使用

You need a PCRE regex to make it work, since there are documented issues with matching multiple word boundaries with gsub (so, add perl=T argument).

gsub和gregexpr的POSIX 1003.2模式在重复的单词边界(例如pattern = "\b")下无法正常工作.使用perl = TRUE进行此类匹配(但对于非ASCII输入可能无法正常工作,因为单词"的含义取决于系统).

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

请注意,如果重复的子字符串可以跨越多行,则可以在模式开始时将PCRE正则表达式与DOTALL修饰符(?s)一起使用(以便.也可以与换行符匹配).

Note that in case your repeated substrings can span across multiple lines, you can use the PCRE regex with the DOTALL modifier (?s) at the start of the pattern (so that a . could also match a newline symbol).

所以,R代码看起来像

gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", s, perl=T)

gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", s, perl=T)

请参见 IDEONE演示:

text <- "are blue and then and then more and then and then more very bright"
gsub("(?s)(\\b.+?\\b)\\1\\b", "\\1", text, perl=T) ## shorter repeated substrings
## [1] "are blue and then more and then more very bright"
gsub("(?s)(\\b.+\\b)\\1\\b", "\\1", text, perl=T) ## longer repeated substrings
## [1] "are blue and then and then more very bright"

这篇关于连续重复二元组的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆