对 numpy 数组进行散列的最有效属性 [英] Most efficient property to hash for numpy array
问题描述
我需要能够将 numpy
array
存储在 dict
中以用于缓存目的.哈希速度很重要.
array
表示索引,因此虽然对象的实际身份并不重要,但值很重要.可变性不是问题,因为我只对当前值感兴趣.
为了将其存储在 dict
中,我应该散列什么?
我目前的方法是使用 str(arr.data)
,在我的测试中它比 md5
更快.
我结合了答案中的一些例子来了解相对时间:
在 [121]: %timeit hash(str(y))10000 个循环,最好的 3 个:每个循环 68.7 us在 [122]: %timeit hash(y.tostring())1000000 个循环,3 个最佳:每个循环 383 ns在 [123]: %timeit hash(str(y.data))1000000 个循环,最好的 3 个:每个循环 543 ns在 [124]: %timeit y.flags.writeable = False ;哈希(y.data)1000000 个循环,最好的 3 个:每个循环 1.15 us在 [125]: %timeit hash((b*y).sum())100000 个循环,最好的 3 个:每个循环 8.12 us
对于这个特定的用例(索引的小数组),arr.tostring
似乎提供了最好的性能.
虽然散列只读缓冲区本身很快,但设置可写标志的开销实际上使它变慢.
你可以简单地散列底层缓冲区,如果你把它设为只读:
<预><代码>>>>a = random.randint(10, 100, 100000)>>>a.flags.writeable = False>>>%timeit hash(a.data)100 个循环,最好的 3 个:每个循环 2.01 毫秒>>>%timeit 哈希(a.tostring())100 个循环,最好的 3 个:每个循环 2.28 毫秒对于非常大的数组,hash(str(a))
要快得多,但它只考虑了数组的一小部分.
I need to be able to store a numpy
array
in a dict
for caching purposes. Hash speed is important.
The array
represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict
?
My current approach is to use str(arr.data)
, which is faster than md5
in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring
offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a))
is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
这篇关于对 numpy 数组进行散列的最有效属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!