使用正则表达式删除重复字符? [英] Remove duplicate chars using regex?

查看:82
本文介绍了使用正则表达式删除重复字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想使用正则表达式删除字符串中的所有重复字符(特定字符的).这很简单 -

导入重新re.sub("a*", "a", "aaaa") # 给出 'a'

如果我想用相应的字符替换所有重复的字符(即 a、z)怎么办?我该怎么做?

导入重新re.sub('[a-z]*', <what_to_put_here>, 'aabb') # 应该给 'ab're.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # 应该给 'abcdefg'

注意:我知道使用哈希表或一些 O(n^2) 算法可以更好地解决这种删除重复方法,但我想使用正则表达式来探索这一点

解决方案

>>>进口重新>>>re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')'fbq'

[az] 周围的 () 指定一个捕获组,然后是 \1 (模式和替换中的 反向引用) 都指向第一个捕获组的内容.

因此,正则表达式读作找到一个字母,然后是一个或多个相同字母的出现",然后整个找到的部分被替换为一个找到的字母.

旁注...

你的 a 示例代码实际上有问题:

<预><代码>>>>re.sub('a*', 'a', 'aaabbbccc')'abababacacaca'

您确实希望使用 'a+' 作为正则表达式而不是 'a*',因为 * 运算符匹配0或更多"出现,因此将匹配两个非 a 字符之间的空字符串,而 + 运算符匹配1 或更多".

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -

import re
re.sub("a*", "a", "aaaa") # gives 'a'

What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?

import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'

NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes

解决方案

>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'

The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.

Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.

On side note...

Your example code for just a is actually buggy:

>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'

You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".

这篇关于使用正则表达式删除重复字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆