Python 和 Perl 中如何实现基本数据类型(字符串和整数) [英] How are basic data types (strings and integers) implemented in Python and Perl

查看:54
本文介绍了Python 和 Perl 中如何实现基本数据类型(字符串和整数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近一直想知道我对字符串和整数等基本类型执行的各种操作如何在性能方面发挥作用,我认为如果我知道这些基本类型是如何实现的(即我听说字符串和整数在 Python 中是不可变的.这是否意味着任何修改字符串中一个字符的操作都是 O(n),因为必须创建一个全新的字符串?添加数字怎么样?)

我对 Python 和 Perl 中的这一点很好奇,并且问两次基本相同的问题感到很傻,所以我只是把它包装成一个.

如果您能在答案中包含一些示例运营成本,那会更有帮助.

解决方案

在 python 中,some_string[5] = 'a' 会出错,但最接近的等效操作,some_string= some_string[5:] + 'a' + some_string[6:] 确实是 O(n).但这不仅适用于不可变对象.连接列表也是如此:[1,2,3] + [4,5,6] 生成一个新列表并且是 O(n).

添加数字会创建一个新值,但通常结果值在内存中总是相同的大小,所以它是 O(1).当然,这只适用于小整数.一旦达到某个阈值(在我的机器上为 20 位数字),整数会突然占用可变数量的空间.我不知道这如何影响渐近性能.

然而,我发现即使在 log10(n) == 1000 附近,它似乎也没有显着影响:

<预><代码>>>>时间 = [timeit.timeit(stmt=stmt.format(10 ** i, 10 ** i), number=100) for i in range(1000)]>>>总和(次) * 1.0/len(次)3.0851364135742186e-06>>>次[-1]3.0994415283203125e-06

对于字符串,渐近性能的影响更加明显:

<预><代码>>>>stmt = 's[:5] + "a" + s[6:]'>>>设置 = 's = "b" * {0}'>>>时间 = [timeit.timeit(stmt=stmt, setup=setup.format(i), number=10) for i in range(100000)]>>>总和(次) * 1.0/len(次)6.2434492111206052e-05>>>次[-1]0.0001220703125

最后一个操作的执行时间远低于平均值.而且趋势相当稳定:

<预><代码>>>>对于 t 时间 [0:100000:10000]:... 打印 t...5.00679016113e-061.31130218506e-052.90870666504e-053.88622283936e-055.10215759277e-056.19888305664e-057.41481781006e-058.48770141602e-059.60826873779e-050.000108957290649

不过,像这样在小字符串上的操作还是很便宜的.

<小时>

为了扩展您的其他问题,列表和字符串的索引访问都是 O(1).

<预><代码>>>>stmt = 'x = s[{0}] + s[{1}] + s[{2}]'>>>设置 = 's = "a" * {0}'>>>时间 = [timeit.timeit(stmt=stmt.format(i/2, i/3, i/4), setup=setup.format(i + 1), number=10) for i in range(1000000)]>>>总和(次) * 1.0/len(次)3.6441037654876707e-06>>>次[-1]3.0994415283203125e-06

列表也一样:

<预><代码>>>>stmt = 'x = s[{0}] + s[{1}] + s[{2}]'>>>设置 = 's = ["a"] * {0}'>>>时间 = [timeit.timeit(stmt=stmt.format(i/2, i/3, i/4), setup=setup.format(i + 1), number=10) for i in range(100000)]>>>总和(次) * 1.0/len(次)2.8617620468139648e-06>>>次[-1]1.9073486328125e-06

切片复制字符串列表,因此是O(n),n == len(slice).没有好"的方法可以替换字符串中的一个字母,尽管我想强调坏"的方法在大多数情况下已经足够好了.如果你想要一个好"的方式,使用不同的数据类型;操作列表并在需要字符串时加入它;或使用 StringIO 对象.此页面有一些关于连接不同内置 Python 数据类型的有用信息.

最后,既然你真的对内部结构感兴趣,我在 stringobject.h(从 2.7 版开始;3+ 可能看起来不同).这是关于你所期望的——一个带有一些额外花里胡哨的 c 字符串:

typedef struct {PyObject_VAR_HEAD

(PyObject_VAR_HEAD 是 ac 预处理器宏,根据 此处.)

 Py_ssize_t ob_refcnt;PyTypeObject *ob_type;py_ssize_t ob_size;

继续...

 long ob_shash;int ob_sstate;字符 ob_sval[1];/* 不变量:* ob_sval 包含 'ob_size+1' 元素的空间.* ob_sval[ob_size] == 0.* ob_shash 是字符串的哈希值,如果尚未计算,则为 -1.* ob_sstate != 0 如果字符串对象在 stringobject.c 中*实习"字典;在这种情况下,两个参考* 从 'interned' 到这个对象在 ob_refcnt 中 * 不计算 *.*/pyStringObject;

列表具有类似的结构 --c 数组有额外的花里胡哨——但不是空终止的,通常有额外的预分配存储空间.

毋庸置疑……其中大部分内容适用于 cPython -- PyPyIronPythonJython 可能看起来完全不同!

I've been wondering lately how various operations I perform on basic types like strings and integers work in terms of performance, and I figure I could get a much better idea of this if I knew how those basic types were implemented (i.e. I've heard strings and integers are immutable in Python. Does that mean any operation that modifies one character in a string is O(n) because a completely new string has to be created? How about adding numbers?)

I'm curious about this in both Python and Perl, and felt silly asking basically the same question twice, so I'm just wrapping it into one.

If you can include some example operation costs with your answer, that would make it even more helpful.

解决方案

In python, some_string[5] = 'a' would be an error, but the closest equivalent operation, some_string = some_string[5:] + 'a' + some_string[6:] would indeed be O(n). But that's not just true of immutable objects. The same is true for concatenating lists: [1,2,3] + [4,5,6] generates a new list and is O(n).

Adding numbers creates a new value, but generally the resulting value is always the same size in memory, so it's O(1). Of course that only holds with small ints. Once you hit some threshold (20 digits on my machine), suddenly ints take up a variable amount of space. I don't know how that effects asymptotic performance.

However, I found that it doesn't seem to have even a significant effect even near log10(n) == 1000:

>>> times = [timeit.timeit(stmt=stmt.format(10 ** i, 10 ** i), number=100) for i in range(1000)]
>>> sum(times) * 1.0 / len(times)
3.0851364135742186e-06
>>> times[-1]
3.0994415283203125e-06

For strings, the asymptotic performance hit is more immediately apparent:

>>> stmt = 's[:5] + "a" + s[6:]'
>>> setup = 's = "b" * {0}'
>>> times = [timeit.timeit(stmt=stmt, setup=setup.format(i), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
6.2434492111206052e-05
>>> times[-1]
0.0001220703125

The execution time for the last operation is well below the average. And the trend is pretty steady:

>>> for t in times[0:100000:10000]:
...     print t
... 
5.00679016113e-06
1.31130218506e-05
2.90870666504e-05
3.88622283936e-05
5.10215759277e-05
6.19888305664e-05
7.41481781006e-05
8.48770141602e-05
9.60826873779e-05
0.000108957290649

Still, operations like these on small strings are pretty cheap.


To expand on your other questions, indexed access is O(1) on both lists and strings.

>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = "a" * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(1000000)]
>>> sum(times) * 1.0 / len(times)
3.6441037654876707e-06
>>> times[-1]
3.0994415283203125e-06

Likewise with lists:

>>> stmt = 'x = s[{0}] + s[{1}] + s[{2}]'
>>> setup = 's = ["a"] * {0}'
>>> times = [timeit.timeit(stmt=stmt.format(i / 2, i / 3, i / 4), setup=setup.format(i + 1), number=10) for i in range(100000)]
>>> sum(times) * 1.0 / len(times)
2.8617620468139648e-06
>>> times[-1]
1.9073486328125e-06

Slicing copies both strings and lists, and is therefore O(n) with n == len(slice). There is no "good" way to replace one letter of a string, although I want to emphasize that the "bad" way is good enough most of the time. If you want a "good" way, use a different data type; manipulate a list and join it when a string is required; or use a StringIO object. This page has some useful information on concatenating different built-in Python datatypes.

Finally, since you're really interested in the internals, I dug up the struct declaration of PyStringObject in stringobject.h (from version 2.7; 3+ probably looks different). It's about what you'd expect -- a c string with some extra bells and whistles:

typedef struct {
    PyObject_VAR_HEAD

(PyObject_VAR_HEAD is a c preprocessor macro that expands to something like the below depending on rules explained here.)

    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
    Py_ssize_t ob_size;

Continuing...

    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

Lists have a similar structure -- c arrays with extra bells and whistles -- but aren't null terminated and generally have extra preallocated storage space.

Needless to say... much of this this applies only to cPython -- PyPy, IronPython, and Jython probably all look totally different!

这篇关于Python 和 Perl 中如何实现基本数据类型(字符串和整数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆