删除字符串中重复字符的最快方法-Python [英] Fastest way to deduplicate contiguous characters in string - Python

查看:2045
本文介绍了删除字符串中重复字符的最快方法-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们可以使用以下方法对字符串中的连续字符进行重复数据删除:

We can deduplicate the contiguous characters in a string with:

def deduplicate(string, char):
    return char.join([substring for substring in string.strip().split(char) if substring])

例如

>>> s = 'this is   an   irritating string with  random spacing  .'
>>> deduplicate(s)
'this is an irritating string with random spacing .'

在命令行上,有trsqueeze选项:

On the command line there is a squeeze option for tr:

$ tr -s " " < file

Python的字符串中是否存在squeeze函数?

Is there a squeeze function in Python's string?

在Python中对字符串中的连续字符进行重复数据删除的最快方法是什么?

请注意,要删除重复数据的字符应该是任何ascii/unicode字符,而不仅仅是\s/空格. (最好为ascii和unicode提供2个子答案.

Please note that the character to be deduplicated should be any ascii/unicode character and not just \s / whitespace. (It's fine to have 2 sub-answers for ascii and unicode.

推荐答案

首先,您的deduplicate函数实际上非常快.但是可以进行一些改进以使其更快.我已经lambda化了您的函数,并将其命名为org_deduplicate(如下).现在进行一些时间测试(使用iPython的%timeit):

First of all, your deduplicate function is actually really fast. But there can be some improvements made to make it even faster. I have lambdaized your function and called it org_deduplicate (below). Now for some time tests (using iPython's %timeit):

s = 'this is   an   irritating string with  random spacing  .'

org_deduplicate = lambda s,c: c.join([substring for substring in s.strip().split(c) if substring])

%timeit org_deduplicate(s,' ')
100000 loops, best of 3: 3.59 µs per loop

,但是strip确实不是必需的,甚至可能会给您带来意想不到的结果(如果您不对空白进行重复数据删除),那么我们可以尝试:

but the strip really isn't necessary and may even give you unexpected results (if you are not deduplicating whitespace) so we can try:

org_deduplicate2 = lambda s,c: c.join(substring for substring in s.split(c) if substring)

%timeit org_deduplicate2(s,' ')
100000 loops, best of 3: 3.4 µs per loop

可以稍微加快速度,但是效果却不尽如人意.让我们尝试另一种方法...正则表达式.这些也很不错,因为它们使您可以灵活地选择任何正则表达式作为要进行重复数据删除的字符"(不仅是单个字符):

which speeds things up by a tiny bit but its not all that impressive. Lets try a different approach... regular expressions. These are also nice because they give you the flexibility to choose any regular expression as your "character" to deduplicate (not just a single char):

import re

re_deduplicate = lambda s,c: re.sub(r'(%s)(?:\1)+' %c, '\g<1>',  s)
re_deduplicate2 = lambda s,c: c.join(re.split('%s+'%c,s))

%timeit re_deduplicate(s,' ')
100000 loops, best of 3: 13.8 µs per loop

%timeit re_deduplicate2(s,' ')
100000 loops, best of 3: 6.47 µs per loop

第二个速度更快,但是都没有一个接近您的原始功能.看起来常规的字符串操作比re函数要快.如果我们改为尝试压缩(如果使用Python 2,请使用itertools.izip):

The second one is faster but neither are even close to your original function. It looks like regular string operations are quicker than re functions. What if we try zipping instead (use itertools.izip if working with Python 2):

zip_deduplicate = lambda s,c: ''.join(s1 for s1,s2 in zip(s,s[1:]) if s1!=c or s1!=s2)

%timeit zip_deduplicate(s,' ')
100000 loops, best of 3: 12.9 µs per loop

仍然没有改善. zip方法会使子字符串过多,从而使''.join的执行速度变慢.好的,再尝试一次...关于str.replace的递归调用:

Still no improvement. The zip method makes too many substrings which makes doing ''.join slow. Ok one more try... what about str.replace called recursively:

def rec_deduplicate(s,c):                                                                             
    if s.find(c*2) != -1:   
        return rec_deduplicate(s.replace(c*2, c),c)
    return s

%timeit rec_deduplicate(s,' ')
100000 loops, best of 3: 2.83 µs per loop

不错,这似乎是我们的赢家.但是可以肯定的是,让我们用一个很长的输入字符串来对我们的原始函数进行尝试:

Not bad, that seems to be our winner. But just to be sure, lets try it against our original function with a really long input string:

s2 = s*100000

%timeit rec_deduplicate(s2,' ')
10 loops, best of 3: 64.6 ms per loop

%timeit org_deduplicate(s2,' ')
1 loop, best of 3: 209 ms per loop

是的,它看起来可以很好地缩放.但是让我们再进行一次测试,递归重复数据删除器每次调用时只会删除长度为2的重复字符.因此,对于长重复的字符,它仍然做得更好吗:

Yup, it looks like it scales nicely. But lets try one more test, the recursive deduplicator only removes duplicate chars of length 2 each time it is called. So does it still do better with long duplicate chars:

s3 = 'this is                       an                        irritating string with                                  random spacing  .'

%timeit rec_deduplicate(s3,' ')
100000 loops, best of 3: 9.93 µs per loop

%timeit org_deduplicate(s3,' ')
100000 loops, best of 3: 8.99 µs per loop

当要删除的重复字符串很长时,它确实失去了一些优势.

It does lose some of its advantage when there are long strings of repeated characters to remove.

总而言之,如果您的字符串中包含重复字符的长子字符串,请使用原始功能(可能需要进行一些调整).否则,递归版本是最快的.

In summary, use your original function (with a few tweaks maybe) if your strings will have long substrings of repeating characters. Otherwise, the recursive version is fastest.

这篇关于删除字符串中重复字符的最快方法-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆