在Python 3中检查字符串是否包含重复字符的最快方法是什么? [英] What is the fastest way to check if a string contains repeating characters in Python 3?

查看:737
本文介绍了在Python 3中检查字符串是否包含重复字符的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据以下条件来过滤字符串:它们两次不包含字符.

I need to filter strings by the criterion that they contain no character twice.

  • 字符串是很多(例如1.4万亿).
  • 字符串(大约8个字符).
  • 字符串是唯一的(缓存无效).
  • 字符串具有大字符集(例如,任何Unicode字符).
  • 字符串通常符合条件(例如,2/3没有重复字符).
  • The strings are many (say 1.4 trillion).
  • The strings are short (around 8 characters).
  • The strings are unique (caching won't work).
  • The strings have a big character set (say any Unicode character).
  • The strings usually meet the criterion (say 2/3 have no repeating characters).

正在使用的代码如下:

>>> candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]
>>> result_strings = [s if unique_chars(s) for s in candidate_strings]
>>> print(result_strings)
["barfnehg", "bazfnehg"]

我实现了一个简单的版本,只需迭代字符串即可:

I implemented a naive version, simply iterating the string:

def unique_chars_naive(string_given):
    """
    Checks if a given string contains only unique characters.
    This version iterates the given string, saving all occurred characters.
    """
    chars_seen = []
    for char in string_given:
        if char in chars_seen:
            return False
        chars_seen.append(char)
    return True

我的下一个好主意是使用set,所以我实现了这一点:

My next-best idea was to use a set, so I implemented that:

def unique_chars_set(string_given):
    """
    Checks if a given string contains only unique characters.
    This version exploits that a set contains only unique entries.
    """
    return len(string_given) == len(set(string_given))

将功能保存到文件UniqueCharacters.py中,并对其计时:

Saving the functions to a file UniqueCharacters.py, timed them:

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_naive(s) for s in candidate_strings]'
100000 loops, best of 3: 20.3 usec per loop

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_set(s) for s in candidate_strings]'
100000 loops, best of 3: 17.7 usec per loop

这表明该数据集的unique_chars_set速度提高了约15%.

This shows that the unique_chars_set is faster by about 15 % for this dataset.

有更快的方法吗?也许用正则表达式?标准库中是否有某些方法可以做到这一点?

Is there a faster way to do this? With regular expressions maybe? Is there some method in the standard library that does this?

推荐答案

首先让我说,我怀疑您在不需要时进行了优化. Python是一种高级语言,支持以高级方式考虑计算.具有可读性,优雅性和可重用性的解决方案通常会比速度飞快但难以理解的解决方案更好.

Let me start off by saying that I suspect that you are optimizing when you don't need to. Python is a high-level language that supports thinking about computation in a high-level manner. A solution that is readable, elegant, and reusable is often going to be better than one that is blazingly fast, but hard to understand.

何时且仅,当您确定速度是一个问题时,则应继续进行优化.也许甚至为计算强度大的部分编写C扩展.

When, and only when, you determine that speed is an issue, then you should proceed with the optimizations. Perhaps even write a C extension for the computationally intense parts.

话虽如此,这里是几种技术的比较:

That being said, here's a comparison of a few techniques:

def unique_chars_set(s):
    return len(s) == len(set(s))

def unique_chars_frozenset(s):
    return len(s) == len(frozenset(s))

def unique_chars_counter(s):
    return Counter(s).most_common(1)[0][1] > 1

def unique_chars_sort(s):
    ss = ''.join(sorted(s))
    prev = ''
    for c in ss:
        if c == prev:
            return False
        prev = c
    return True

def unique_chars_bucket(s):
    buckets = 255 * [False]
    for c in s:
        o = ord(c)
        if buckets[o]:
            return False
        buckets[o] = True
    return True

这是性能比较(在IPython中):

And here is the performance comparisons (in IPython):

In [0]: %timeit -r10 [unique_chars_set(s) for s in candidate_strings]
100000 loops, best of 10: 6.63 us per loop

In [1]: %timeit -r10 [unique_chars_frozenset(s) for s in candidate_strings]
100000 loops, best of 10: 6.81 us per loop

In [2]: %timeit -r10 [unique_chars_counter(s) for s in candidate_strings]
10000 loops, best of 10: 83.1 us per loop

In [3]: %timeit -r10 [unique_chars_sort(s) for s in candidate_strings]
100000 loops, best of 10: 13.1 us per loop

In [4]: %timeit -r10 [unique_chars_bucket(s) for s in candidate_strings]
100000 loops, best of 10: 15 us per loop

结论:set比许多其他显而易见的方法优雅且快速.但是差异是如此之小,无论如何都没关系.

Conclusion: set is elegant and faster than many other obvious methods. But the differences are so small, it doesn't matter anyway.

有关更多基准,请参见 @FrancisAvila 的答案.

For more benchmarks, see @FrancisAvila's answer.

这篇关于在Python 3中检查字符串是否包含重复字符的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆