字符串切片的时间复杂度 [英] Time complexity of string slice

查看:120
本文介绍了字符串切片的时间复杂度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

切片 Python 字符串的时间复杂度是多少?鉴于 Python 字符串是不可变的,我可以想象将它们切片为 O(1)O(n) 取决于切片的实现方式.

What's the time complexity of slicing a Python string? Given that Python strings are immutable, I can imagine slicing them being either O(1) or O(n) depending upon how slicing is implemented.

我需要编写一个函数来遍历(可能很大)字符串的所有后缀.我可以通过将后缀表示为整个字符串的元组加上一个索引来开始读取字符来避免对字符串进行切片,但这很丑陋.如果我天真地写我的函数是这样的:

I need to write a function that iterates over all suffixes of a (potentially big) string. I could avoid slicing the string by representing a suffix as a tuple of the entire string plus an index to start reading characters from, but that's ugly. If instead I naively write my function like this:

def do_something_on_all_suffixes(big_string):
    for i in range(len(big_string)):
        suffix = big_string[i:]
        some_constant_time_operation(suffix)

...它的时间复杂度是O(n)还是O(n2),其中nlen(big_string)?

... will its time complexity be O(n) or O(n2), where n is len(big_string)?

推荐答案

简短回答:str 切片,一般来说,复制.这意味着为每个字符串的 n 后缀做切片的函数正在执行 O(n2) 工作.也就是说,如果您可以使用 bytes 的对象,则可以避免复制="noreferrer">memoryviews 以获取原始字节数据的零拷贝视图.请参阅下面的如何进行零拷贝切片,了解如何使其工作.

Short answer: str slices, in general, copy. That means that your function that does a slice for each of your string's n suffixes is doing O(n2) work. That said, you can avoid copies if you can work with bytes-like objects using memoryviews to get zero-copy views of the original bytes data. See How you can do zero copy slicing below for how to make it work.

长答案:(C)Python str 不要通过引用数据子集的视图来切片.str 切片正好有三种操作模式:

Long answer: (C)Python str do not slice by referencing a view of a subset of the data. There are exactly three modes of operation for str slicing:

  1. 完整切片,例如mystr[:]:返回对完全相同的str 的引用(不仅仅是共享数据,相同的实际对象,mystr 是 mystr[:]> 因为 str 是不可变的,所以这样做没有风险)
  2. 零长度切片和(取决于实现)缓存长度为 1 的切片;空字符串是单例(mystr[1:1] is mystr[2:2] is ''),长度为 1 的低序数字符串也是缓存的单例(在 CPython 3.5.0 上),看起来所有可以用 latin-1 表示的字符,即 range(256) 中的 Unicode 序数,都被缓存了)
  3. 所有其他切片:切片的 str 在创建时被复制,此后与原始 str
  4. 无关
  1. Complete slice, e.g. mystr[:]: Returns a reference to the exact same str (not just shared data, the same actual object, mystr is mystr[:] since str is immutable so there is no risk to doing so)
  2. The zero length slice and (implementation dependent) cached length 1 slices; the empty string is a singleton (mystr[1:1] is mystr[2:2] is ''), and low ordinal strings of length one are cached singletons as well (on CPython 3.5.0, it looks like all characters representable in latin-1, that is Unicode ordinals in range(256), are cached)
  3. All other slices: The sliced str is copied at creation time, and thereafter unrelated to the original str

#3 是一般规则的原因是为了避免将大 str 保存在内存中的问题,因为它的一小部分.如果你有一个 1GB 的文件,读入它并像这样切片(是的,当你可以搜索时这是浪费,这是为了说明):

The reason why #3 is the general rule is to avoid issues with large str being kept in memory by a view of a small portion of it. If you had a 1GB file, read it in and sliced it like so (yes, it's wasteful when you can seek, this is for illustration):

with open(myfile) as f:
    data = f.read()[-1024:]

那么您将有 1 GB 的数据保存在内存中以支持显示最后 1 KB 的视图,这是一种严重的浪费.由于切片通常很小,因此在切片上复制而不是创建视图几乎总是更快.这也意味着 str 可以更简单;它需要知道它的大小,但它也不需要跟踪数据中的偏移量.

then you'd have 1 GB of data being held in memory to support a view that shows the final 1 KB, a serious waste. Since slices are usually smallish, it's almost always faster to copy on slice instead of create views. It also means str can be simpler; it needs to know its size, but it need not track an offset into the data as well.

的方法可以在 Python 中执行基于视图的切片,而在 Python 2 中,它将适用于 str(因为 str 是字节- 就像在 Python 2 中一样,支持 缓冲区协议).使用 Py2 str 和 Py3 bytes(以及许多其他数据类型,如 bytearrayarray.arraynumpy 数组、mmap.mmaps 等),您可以创建一个 memoryview 即原对象的零拷贝视图,可以切片不拷贝数据.因此,如果您可以使用(或编码)到 Py2 str/Py3 bytes,并且您的函数可以使用任意 bytes-like 对象,那么你可以这样做:

There are ways to perform view based slicing in Python, and in Python 2, it will work on str (because str is bytes-like in Python 2, supporting the buffer protocol). With Py2 str and Py3 bytes (as well as many other data types like bytearray, array.array, numpy arrays, mmap.mmaps, etc.), you can create a memoryview that is a zero copy view of the original object, and can be sliced without copying data. So if you can use (or encode) to Py2 str/Py3 bytes, and your function can work with arbitrary bytes-like objects, then you could do:

def do_something_on_all_suffixes(big_string):
    # In Py3, may need to encode as latin-1 or the like
    remaining_suffix = memoryview(big_string)
    # Rather than explicit loop, just replace view with one shorter view
    # on each loop
    while remaining_suffix:  # Stop when we've sliced to empty view
        some_constant_time_operation(remaining_suffix)
        remaining_suffix = remaining_suffix[1:]

memoryview 的切片确实创建了新的视图对象(它们只是具有固定大小的超轻量级,与它们查看的数据量无关),只是没有任何数据,所以 some_constant_time_operation 可以根据需要存储一个副本,以后我们将其切片时不会更改它.如果您需要一个正确的副本作为 Py2 str/Py3 bytes,您可以调用 .tobytes() 来获取原始 bytes obj,或(在 Py3 中仅出现),将其直接解码为从缓冲区复制的 str,例如str(remaining_suffix[10:20], 'latin-1').

The slices of memoryviews do make new view objects (they're just ultra-lightweight with fixed size unrelated to the amount of data they view), just not any data, so some_constant_time_operation can store a copy if needed and it won't be changed when we slice it down later. Should you need a proper copy as Py2 str/Py3 bytes, you can call .tobytes() to get the raw bytes obj, or (in Py3 only it appears), decode it directly to a str that copies from the buffer, e.g. str(remaining_suffix[10:20], 'latin-1').

这篇关于字符串切片的时间复杂度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆