删除连续字符中的重复项 [英] Remove duplicates within consecutive runs of characters

查看:52
本文介绍了删除连续字符中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的字符串包含很多重复项,例如:

I have strings containing lots of duplicates, like this:

tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B", 
     "A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")

我想删除所有连续重复的大写字母和"*" 字符,因此预期结果是这样的:

I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:

[1] "CBC*C" "A"     "*B"    "A*A*A" "*C"    "A"

我已经成功提取了重复的大写字母:

I've successfully extracted the duplicated capitals:

library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"

但是有点卡在这里.我的直觉是返回索引的函数 which 可能有帮助,但在这种情况下不知道如何实现.

but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.

有什么想法吗?

编辑:

我离解决方案并不远-仅使用前瞻(而不是正前瞻)就可以解决问题:

I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:

str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"

[[2]]
[1] "A"

[[3]]
[1] "*" "B"

[[4]]
[1] "A" "*" "A" "*" "A"

[[5]]
[1] "*" "C"

[[6]]
[1] "A"

推荐答案

我们可以使用 gsub

gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"

为了获得第二个结果,请删除>

In order to get the second result, remove the >

gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"

基于以下OP的评论,可能是

Based on the OP's comments below, may be

gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"

这篇关于删除连续字符中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆