通过删除连续的重复项来减少字符串长度 [英] Reduce string length by removing contiguous duplicates
问题描述
我有2个字段的R数据帧:
I have an R dataframe whith 2 fields:
ID WORD
1 AAAAABBBBB
2 ABCAAABBBDDD
3 ...
我想通过保持字母的重复来简化单词
I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:
例如: AAAAABBBBB
应该只给我 AB
和 ABCAAABBBDDD
应该给我 ABCABD
e.g.: AAAAABBBBB
should give me AB
and ABCAAABBBDDD
should give me ABCABD
有人对如何执行此操作有想法吗?
Anyone has an idea on how to do this?
推荐答案
这是使用正则表达式的解决方案:
Here's a solution with regex:
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)
编辑:应要求提供一些基准测试。我在注释中添加了Matthew Lundberg的图案,与任何字符都匹配。看来 gsub
快一个数量级,匹配任何字符都比匹配字母快。
By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub
is faster by an order of magnitude, and matching any character is faster than matching letters.
library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
# expr min lq median uq max
# 1 GSUB.ANY 1.433873 1.509215 1.562193 1.664664 3.324195
# 2 GSUB.LETTER 1.940916 2.059521 2.108831 2.227435 3.118152
# 3 SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952
##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()
这篇关于通过删除连续的重复项来减少字符串长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!