通过删除连续的重复项来减少字符串长度 [英] Reduce string length by removing contiguous duplicates

查看:124
本文介绍了通过删除连续的重复项来减少字符串长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个字段的R数据帧:

I have an R dataframe whith 2 fields:

ID             WORD
1           AAAAABBBBB
2           ABCAAABBBDDD
3           ...

我想通过保持字母的重复来简化单词

I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:

例如: AAAAABBBBB 应该只给我 AB
ABCAAABBBDDD 应该给我 ABCABD

e.g.: AAAAABBBBB should give me AB and ABCAAABBBDDD should give me ABCABD

有人对如何执行此操作有想法吗?

Anyone has an idea on how to do this?

推荐答案

这是使用正则表达式的解决方案:

Here's a solution with regex:

x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)

编辑:应要求提供一些基准测试。我在注释中添加了Matthew Lundberg的图案,与任何字符都匹配。看来 gsub 快一个数量级,匹配任何字符都比匹配字母快。

By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub is faster by an order of magnitude, and matching any character is faster than matching letters.

library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
  replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
    SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
    ,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
    ,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
         # expr       min        lq    median        uq       max
# 1    GSUB.ANY  1.433873  1.509215  1.562193  1.664664  3.324195
# 2 GSUB.LETTER  1.940916  2.059521  2.108831  2.227435  3.118152
# 3      SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952

##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()

这篇关于通过删除连续的重复项来减少字符串长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆