高效查找字符串中的重复字符 [英] Efficiently find repeated characters in a string

查看:210
本文介绍了高效查找字符串中的重复字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这段代码的效率不是最优的(尤其是对于巨大的输入),而且我知道有一种方法可以改变这个算法来处理其他数据类型,而不仅仅是字符串中的重复(显然有只能搜索这么多字符).

I know that the efficiency of this code is not optimal (esp. with gigantic inputs), and I know that there is a way to change this algorithm to handle other data types and not just a repetition in a string (obviously there are only so many characters to search through).

有什么办法可以提高效率吗?

Is there any way I can increase efficiency here?

我尝试使用字典,但函数一直返回none",所以我尝试了一个列表,结果一切正常.

I tried using a dictionary and the function kept returning 'none' so I tried a list and things worked out fine.

提前感谢任何可以帮助我的人!

Thanks ahead of time to anyone who can help me out!

def find_repeater(string):
    my_list = []
    my_list.append(string[0])

    for i in range (1, len(string)):

        if string[i] in my_list:
            print 'repetition found'
            return (string[i])

        else:
            my_list.append(string[i])

print find_repeater('abca')  

现在有了字典......(它一直在控制台上打印none")

now with a dictionary....(it keeps printing 'none' to the console)

def find_repeater(string):
    my_dict = {}
    my_dict[0] = string[0]

    for i in range (1, len(string)):

        if string[i] in my_dict:
            print 'repetition found'
            return string[i]

        else:
            my_dict[i] = string[i]

print find_repeater('abca')  

推荐答案

由于这是一个性能问题,让我们做一些时间安排:

As this is a performance question, let's do some timings:

def test_set(xs):
    seen = set()  # O(1) lookups
    for x in xs:
        if x not in seen:
            seen.add(x)
        else:
            return x

import collections

def test_counter(xs):
    freq = collections.Counter(xs)
    for k in freq:
        if freq[k] > 1:
            return k

def test_dict(xs):
    d = {}
    for x in xs:
        if x in d:
            return x
        d[x] = 1

def test_sort(xs):
    ys = sorted(xs)

    for n in range(1, len(xs)):
        if ys[n] == ys[n-1]:
            return ys[n]

##

import sys, timeit
print (sys.version + "\n")
xs = list(range(10000)) + [999]
fns = [p for name, p in globals().items() if name.startswith('test')]
for fn in fns:
    assert fn(xs) == 999
    print ('%50s %.5f' % (fn, timeit.timeit(lambda: fn(xs), number=100)))

我正在测试一个整数列表而不是一个字符串(因为使用一个字符串你不能得到超过 256 个循环).我机器上的结果是这样的:

I'm testing on an list of integers rather than a string (because with a string you can't get more than 256 loops). The results on my machine look like this:

3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

                <function test_set at 0x1020f7380> 0.19265
               <function test_dict at 0x1020f7490> 0.12725
               <function test_sort at 0x1020f7518> 0.04683
            <function test_counter at 0x1020f7408> 0.92485

所以排序方法似乎是赢家.我想这是因为它不会浪费时间创建哈希和分配 dict/set 结构.此外,如果您不关心正在更改的源列表,您可以执行 xs.sort() 而不是 ys = sorted(xs),这会给您零内存占用.

So the sort method appears to be the winner. I guess this is because it doesn't waste time creating hashes and allocating dict/set structures. Also, if you don't care about the source list being changed, you can do xs.sort() instead of ys = sorted(xs), which gives you zero memory footprint.

另一方面,如果重复项更有可能出现在输入的开头(如xs = 'abcdef' * 10000),则set方法将执行最佳,因为它与 sortCounter 不同,一旦找到重复项就立即返回并且不需要预处理整个列表.如果您需要 first 重复元素,而不仅仅是其中一个,您还应该使用 set.

On the other side, if repeated items are more probable to occur towards the beginning of the input (as in xs = 'abcdef' * 10000), the set method will perform the best, as it, unlike sort or Counter, returns immediately once a repeat is found and doesn't need to preprocess the whole list. You should also use set if you need the first repeating element, not just one of them.

Counter 是一个不错的工具,但它不是为性能而设计的,所以如果你真的必须处理巨大的输入",请使用集合(如果它们适合内存)或归并排序(如果它们)不要.

Counter is a nice tool, but it's not designed for performance, so if you really have to deal with "gigantic inputs", go with sets (if they fit in memory) or mergesort if they don't.

这篇关于高效查找字符串中的重复字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆