Python 3字符串索引查找是否为O(1)? [英] Python 3 string index lookup is O(1)?

查看:152
本文介绍了Python 3字符串索引查找是否为O(1)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

短篇小说:

Python 3 unicode字符串查找是O(1)还是O(n)?

长话短说:

在C char数组中查找字符的索引是恒定时间O(1),因为我们可以确定地跳转到连续的内存位置:

const char* mystring = "abcdef";
char its_d = mystring[3];

与说的话一样

char its_d = *(mystring + 3);

因为我们知道sizeof(char)和C99一样是1,并且由于ASCII,一个字符适合一个字节.

现在,在Python 3中,字符串文字是unicode字符串,我们具有以下内容:

>>> mystring = 'ab€cd'
>>> len(mystring)
5
>>> mybytes = mystring.encode('utf-8')
>>> len(mybytes)
7
>>> mybytes
b'ab\xe2\x82\xaccd'
>>> mystring[2]
'€'
>>> mybytes[2]
226
>> ord(mystring[2])
8364

由于采用了UTF-8编码,所以字节2> 127,因此对字符3使用了多字节表示形式.

我只能断定Python字符串中的索引查找不能为O(1),因为字符是多字节表示的?这意味着mystring[2]是O(n),并且以某种方式对存储数组进行即时解释,以便找到索引处的字符?如果是这样,我是否错过了一些相关文档说明这一点?

我做了一些非常基本的基准测试,但是我无法推断出O(n)行为: https://gist.github.com/carlos-jenkins/e3084a07402ccc25dfd0038c9fe284b5

$ python3 lookups.py
Allocating memory...
Go!
String lookup: 0.513942 ms
Bytes lookup : 0.486462 ms

更新了更好的示例.

解决方案

UTF-8是Python的默认 source 编码. 内部表示形式使用固定大小的每个字符元素大小在Python 2和Python 3中都是如此.结果之一是,按索引访问Python(Unicode)字符串对象中的字符会产生O(1)的开销.

您提供的代码和结果没有其他说明.您将string转换为UTF-8编码的字节序列,我们都知道UTF-8使用可变长度的代码序列,但是这些都没有说明原始string的内部表示. >

Short story:

Is Python 3 unicode string lookup O(1) or O(n)?

Long story:

Index lookup of a character in a C char array is constant time O(1) because we can with certainty jump to a contiguous memory location:

const char* mystring = "abcdef";
char its_d = mystring[3];

Its the same as saying:

char its_d = *(mystring + 3);

Because we know that sizeof(char) is 1 as C99, and because of ASCII one character fits in one byte.

Now, in Python 3, now that string literals are unicode strings, we have the following:

>>> mystring = 'ab€cd'
>>> len(mystring)
5
>>> mybytes = mystring.encode('utf-8')
>>> len(mybytes)
7
>>> mybytes
b'ab\xe2\x82\xaccd'
>>> mystring[2]
'€'
>>> mybytes[2]
226
>> ord(mystring[2])
8364

Being UTF-8 encoded, byte 2 is > 127 and thus uses a multibyte representation for the character 3.

I cannot other than conclude that a index lookup in a Python string CANNOT be O(1), because of the multibyte representation of characters? That means that mystring[2] is O(n), and that somehow a on-the-fly interpretation of the memory array is being performed ir order to find the character at index? If that's the case, did I missed some relevant documentation stating this?

I made some very basic benchmark but I cannot infer an O(n) behaviour: https://gist.github.com/carlos-jenkins/e3084a07402ccc25dfd0038c9fe284b5

$ python3 lookups.py
Allocating memory...
Go!
String lookup: 0.513942 ms
Bytes lookup : 0.486462 ms

EDIT: Updated with better example.

解决方案

UTF-8 is the default source encoding for Python. The internal representation uses fixed-size per-character elements in both Python 2 and Python 3. One of the results is that accessing characters in Python (Unicode) string objects by index has O(1) cost.

The code and results you presented do not demonstrate otherwise. You convert a string to a UTF-8-encoded byte sequence, and we all know that UTF-8 uses variable-length code sequences, but none of that says anything about the internal representation of the original string.

这篇关于Python 3字符串索引查找是否为O(1)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆