用R删除字符串中的重复元素 [英] Remove repeated elements in a string with R

查看:39
本文介绍了用R删除字符串中的重复元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我计划从字符串中删除重复的元素(每个包含两个或更多字符).例如,从aaa"我期待aaa",从aaaa"我期待aa",从abababcdcd"我期待abcd",从cdababcdcd"我期待cdabcd".

I plan to remove repeated elements (each containing two or more characters) from strings. For example, from "aaa" I expect "aaa", from "aaaa" I expect "aa", from "abababcdcd" I epxect "abcd", from "cdababcdcd" I expect "cdabcd".

我尝试了 gsub("(.{2,})\\1+","\\1",str).1-3 情况下有效,4 情况下失败.如何解决?

I tried gsub("(.{2,})\\1+","\\1",str). It works in cases 1-3, but fails in case 4. How to solve this problem?

推荐答案

解决方案

解决方案是依靠 PCRE 或 ICU 正则表达式引擎,而不是 TRE.

The solution is to rely on the PCRE or ICU regex engines, rather than TRE.

使用带有 perl=TRUE 的基本 R gsub(它使用 PCRE 正则表达式引擎) 和 "(?s)(.{2,})\\1+" 模式,或 stringr::str_replace_all()(它使用 ICU 正则表达式引擎)具有相同的模式:

Use either base R gsub with perl=TRUE (it uses PCRE regex engine) and "(?s)(.{2,})\\1+" pattern, or a stringr::str_replace_all() (it uses ICU regex engine) with the same pattern:

> x <- "cdababcdcd"
> gsub("(?s)(.{2,})\\1+", "\\1", x, perl=TRUE)
[1] "cdabcd"
> library(stringr)
> str_replace_all(x, "(?s)(.{2,})\\1+", "\\1")
[1] "cdabcd"

(?s) 标志对于 . 是必要的,以匹配任何字符,包括换行符(在 TRE 正则表达式中,. 匹配所有默认为字符).

The (?s) flag is necessary for . to match any char including line break chars (in TRE regex, . matches all chars by default).

详情

TRE 正则表达式 不擅长处理病态"主要与回溯相关的案例,直接涉及量词(我将某些部分加粗):

TRE regex is not good at handling "pathological" cases that are mostly related to backtracking, which directly involves quantifiers (I bolded some parts):

TRE 中使用的匹配算法在被搜索文本的长度中使用线性最坏情况时间,在所用正则表达式的长度中使用二次最坏情况时间.换句话说,算法的时间复杂度是O(M2N),其中M是正则表达式的长度,N是文本的长度. 使用的空间也是正则表达式长度的二次方,但不依赖于搜索的字符串.这种二次行为只发生在病理情况下,在实践中可能非常罕见.

The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice.

可预测的匹配速度
由于 TRE 中使用的匹配算法,任何 regexec() 调用所消耗的最长时间总是与搜索字符串的长度成正比.有一个例外:如果使用反向引用,匹配可能需要随着字符串长度呈指数增长的时间. 这是因为 匹配反向引用是一个 NP 完全问题,在最坏的情况下几乎肯定需要指数时间来匹配.

Predictable matching speed
Because of the matching algorithm used in TRE, the maximum time consumed by any regexec() call is always directly proportional to the length of the searched string. There is one exception: if back references are used, the matching may take time that grows exponentially with the length of the string. This is because matching back references is an NP complete problem, and almost certainly requires exponential time to match in the worst case.

在这些情况下,当 TRE 无法计算匹配字符串的所有可能性时,它不会返回任何匹配项,字符串将原样返回.因此,gsub 调用没有变化.

In those cases when TRE has trouble calculating all possibilities of matching a string it does not return any match, the string is returned as is. Hence, there is no changes in the gsub call.

这篇关于用R删除字符串中的重复元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆