删除连续字符中的重复项 [英] Remove duplicates within consecutive runs of characters
问题描述
我的字符串包含很多重复项,例如:
I have strings containing lots of duplicates, like this:
tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B",
"A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")
我想删除所有连续重复的大写字母和"*" 字符,因此预期结果是这样的:
I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:
[1] "CBC*C" "A" "*B" "A*A*A" "*C" "A"
我已经成功提取了重复的大写字母:
I've successfully extracted the duplicated capitals:
library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"
但是有点卡在这里.我的直觉是返回索引的函数 which
可能有帮助,但在这种情况下不知道如何实现.
but am somewhat stuck here. My hunch is that the function which
, which returns indices, might be of help but don't know how to implement it in this case.
有什么想法吗?
编辑:
我离解决方案并不远-仅使用负前瞻(而不是正前瞻)就可以解决问题:
I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:
str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"
[[2]]
[1] "A"
[[3]]
[1] "*" "B"
[[4]]
[1] "A" "*" "A" "*" "A"
[[5]]
[1] "*" "C"
[[6]]
[1] "A"
推荐答案
我们可以使用 gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
为了获得第二个结果,请删除>
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
基于以下OP的评论,可能是
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
这篇关于删除连续字符中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!