python位数组(高性能) [英] python bit array (performant)

查看:55
本文介绍了python位数组(高性能)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个布隆过滤器,我想知道 Python 中性能最高的位数组实现是什么.

I'm designing a bloom filter and I'm wondering what the most performant bit array implementation is in Python.

Python 的好处是它可以开箱即用地处理任意长度的整数,这就是我现在使用的,但我对 Python 内部结构的了解不够,不知道这是否是在 Python 中执行此操作的最佳方法.

The nice thing about Python is that it can handle arbitrary length integers out of the box and that's what I use now, but I don't know enough about Python internals to know if that's the most performant way to do it in Python.

我发现了 bitarray 但它处理了很多其他事情,比如切片,我不知道不需要.我只需要 &|<< 操作.

I found bitarray but it handles a lot of other things like slicing, which I don't need. I only need the & and | and << operations.

推荐答案

内置的 int 优化得相当不错,已经支持 &, |<<<.

The built-in int is pretty nicely optimized, and it already supports &, |, and <<.

至少有一种基于 GMP 的任意长度整数的替代实现,称为 gmpy2.(还有原始的 gmpyPyGMPSophie 和其他一些围绕同一个库的包装器,但我怀疑他们是否有任何实际性能差异.)

There's at least one alternative implementation of arbitrary-length integers, based on GMP, known as gmpy2. (There's also the original gmpy, PyGMP, Sophie, and a few other wrappers around the same library, but I doubt they'd have any real performance differences.)

位数组"思想有两个主要实现,bitarray(您链接的那个)和 bitstring,如以及一些像 intbitset 这样的库,它们可以为您提供取而代之的是类似 set 的界面(这也应该适用于您的用途).

And there are two major implementations of the "bit array" idea, bitarray (the one you linked) and bitstring, as well as a few libraries like intbitset that give you a set-like interface instead (which should also work for your uses).

所以,让我们把它们放在一起比较:

So, let's toss them all together and compare:

import random
import struct
import timeit
import bitarray
import bitstring
import gmpy2

n = random.randrange((1<<31)+1, 1<<32)

bs = bitstring.pack('<q', n)
ba = bitarray.bitarray(64)
ba.frombytes(struct.pack('<q', n))
gm = gmpy2.mpz(n)
py = n

for x in 'bs', 'ba', 'gm', 'py':
    def f(x=locals()[x]): x | x; x & x
    t = timeit.timeit(f, number=10000)
    print(x, t)

在我的 Mac 上,运行 Python.org 64 位 CPython 3.3.2,这是我得到的:

On my Mac, running Python.org 64-bit CPython 3.3.2, here's what I get:

bs 0.7623525890521705
ba 0.006623028079047799
gm 0.0016346259508281946
py 0.002280334010720253

这似乎足以立即驳回bitstring.

另外,虽然我没有在这里展示 intbitset,因为它和我发现的任何类似库都不适用于 Python 3,通过各种比较(使用 intbitset.intbitset([i for i, bit in enumerate(bin(n)[2:]) if bit != '0'])) 它比 int 慢 14 到 70 倍测试我扔了它,所以我也草率地驳回了它.

Also, while I didn't show intbitset here, because neither it nor any similar libraries I found work with Python 3, from a variety of comparisons (using intbitset.intbitset([i for i, bit in enumerate(bin(n)[2:]) if bit != '0'])) it's anywhere from 14 to 70 times slower than int in every test I throw at it, so I've summarily dismissed it as well.

所以让我们用更多的代表尝试其他三个:

So let's try the other three with more reps:

ba 6.385123810963705
gm 1.5937359740491956
py 2.129726824001409

而且数字保持不变.bitarray 远不如内置的 int 快.使用起来也比较笨拙(请注意,应该是替代构造函数"的类方法是实例方法,没有快速简便的方法可以从 int 转换或转换为 int,这也是我只测试 x | x 的原因x & x 是对运算符有限制).如果您需要一个整数作为位数组,那就太好了;如果您需要对整数进行 C 风格的位处理,这不是它最擅长的.

And the numbers hold up. bitarray is nowhere near as fast as built-in int. It's also more clumsy to use (note that what should be an "alternate constructor" classmethod is an instance method, there's no fast and easy way to convert from or to an int, and the reason I was only testing x | x and x & x is that there are limitations on the operators). If you need an integer as an array of bits, it's great; if you need to do C-style bit-munging on an integer, that's not what it's best at.

至于 gmpy2,它似乎比原生的 int 快三分之一左右.如果我们把数字放大很多,比如 1.5kbits 会怎样?

As for gmpy2, it seems to be about a third faster than the native int. What if we make the numbers a whole lot bigger, like 1.5kbits?

gm 0.19562570203561336
py 0.29293217696249485

这是 Apple Python 2.7.5 的数字:

And here are the numbers with Apple Python 2.7.5:

('gm', 0.2890629768371582)
('py', 0.36592698097229004)

那么,值得吗?它使用起来不太友好,在您没有询问的其他一些操作上它更慢而不是更快,它需要一个 LGPL 许可的第三方 C 库,它在内存溢出情况下的错误处理行为要差得多(就像,您的应用程序可能会出现段错误而不是提升),并且 StackOverflow 上至少有一个人会出现并告诉您,每当提到 GMP 时都很糟糕(我相信是因为旧版本中的错误).

So, is it worth it? It's less friendly to use, it's slower rather than faster on some other operations that you didn't ask about, it requires a third-party C library which is LGPL-licensed, it has much worse error-handling behavior in memory overflow cases (as in, your app may segfault instead of raising), and there's at least one person on StackOverflow who is going to show up and tell you that GMP sucks whenever it gets mentioned (I believe because of a bug in an older version).

但如果您需要额外的速度,也许值得.

But if you need that extra speed, maybe it's worth it.

另一方面,这里是 PyPy、3.2.3/2.1.0b1 和 PyPy 2.7.3/2.1.0,使用原生 int 类型——与来自 g 的 0.19562570203561336 和 0.28906297822 结果进行比较以上:

On the other hand, here's PyPy, 3.2.3/2.1.0b1 and PyPy 2.7.3/2.1.0, using the native int type—compare with the 0.19562570203561336 and 0.2890629768371582 results from gmpy2 above:

py 0.2135779857635498
('py', 0.20878291130065918)

因此,从 CPython 切换到 PyPy 给您带来的好处几乎与从 Python 3 中的 int 切换到 gmpy2.mpz 一样多,而且在 Python 2 中的好处要多得多.它几乎肯定也会加速您的其余代码.

So, switching from CPython to PyPy gives you almost as much benefit as switching from int to gmpy2.mpz in Python 3, and significantly more benefit in Python 2. And it will almost certainly accelerate the rest of your code as well.

这篇关于python位数组(高性能)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆