在单元测试中提取哈希种子 [英] extract hash seed in unit testing
问题描述
我需要获取python用于复制失败单元测试的随机哈希种子.
I need to get the random hash seed used by python to replicate failing unittests.
如果 PYTHONHASHSEED 设置为非零整数, sys.flags.hash_randomization
可靠地提供了它:
If PYTHONHASHSEED is set to a non-zero integer, sys.flags.hash_randomization
provides it reliably:
$ export PYTHONHASHSEED=12345
$ python3 -c 'import sys, os;print(sys.flags.hash_randomization, os.environ.get("PYTHONHASHSEED"))'
12345 12345
但是,如果哈希是随机的,则仅声明已使用种子,而不声明:
However, if hashing is randomised, it only states that a seed is used, not which:
$ export PYTHONHASHSEED=random
$ python3 -c 'import sys, os;print(sys.flags.hash_randomization, os.environ.get("PYTHONHASHSEED"))'
1 random
sys.hash_info
中的信息从不包含数据取决于种子.使用自python3.4起的哈希函数,尝试尝试也是不可行的并从给定的哈希值重建种子.
The information in sys.hash_info
never includes data depending on the seed. With the hash function since python3.4, it seems also unfeasible to try and reconstruct the seed from given hashes.
上下文:在微调算法时,我们已经看到了依赖于set/dict迭代顺序的heisenbug.复制它们需要测试种子,最坏的情况是要测试4294967295,但即使是我们平均约100次测试,也相当长.
Context: When fine tuning an algorithm, we've seen heisenbugs that depend on set/dict iteration order. Replicating them requires testing seeds, at worst all 4294967295, but even our average of ~100 tests is quite lengthy.
我们一直考虑始终在外部将PYTHONHASHSEED设置为随机但已知的值,但希望避免这一额外层.
We have considered always externally setting PYTHONHASHSEED to random but known values, but would like to avoid this extra layer.
推荐答案
No, the random value is assigned to the uc
field of the _Py_HashSecret
union, but this is never exposed to Python code. That's because the number of possible values is far greater than what setting PYTHONHASHSEED
can produce.
当您未设置PYTHONHASHSEED
或将其设置为random
时,Python会生成一个随机的24字节值用作种子.如果将PYTHONHASHSEED
设置为整数,则该数字将通过 线性同余生成器 生成实际种子(请参见
When you don't set PYTHONHASHSEED
or set it to random
, Python generates a random 24-byte value to use as the seed. If you set PYTHONHASHSEED
to an integer then that number is passed through a linear congruential generator to produce the actual seed (see the lcg_urandom()
function). The problem is that PYTHONHASHSEED
is limited to 4 bytes only. There are 256 ** 20 times more possible seed values than you could set via PYTHONHASHSEED
alone.
您可以使用ctypes
访问_Py_HashSecret
结构中的内部哈希值:
You can access the internal hash value in the _Py_HashSecret
struct using ctypes
:
from ctypes import (
c_size_t,
c_ubyte,
c_uint64,
pythonapi,
Structure,
Union,
)
class FNV(Structure):
_fields_ = [
('prefix', c_size_t),
('suffix', c_size_t)
]
class SIPHASH(Structure):
_fields_ = [
('k0', c_uint64),
('k1', c_uint64),
]
class DJBX33A(Structure):
_fields_ = [
('padding', c_ubyte * 16),
('suffix', c_size_t),
]
class EXPAT(Structure):
_fields_ = [
('padding', c_ubyte * 16),
('hashsalt', c_size_t),
]
class _Py_HashSecret_t(Union):
_fields_ = [
# ensure 24 bytes
('uc', c_ubyte * 24),
# two Py_hash_t for FNV
('fnv', FNV),
# two uint64 for SipHash24
('siphash', SIPHASH),
# a different (!) Py_hash_t for small string optimization
('djbx33a', DJBX33A),
('expat', EXPAT),
]
hashsecret = _Py_HashSecret_t.in_dll(pythonapi, '_Py_HashSecret')
hashseed = bytes(hashsecret.uc)
但是,您实际上不能做任何具有此信息的事情.您不能在新的Python进程中设置_Py_HashSecret.uc
,因为这样做会破坏大多数设置的字典键,然后才可以从Python代码中设置(Python内部结构严重依赖于字典),并且散列的可能性等于256 ** 4个可能的LCG值几乎消失了.
However, you can't actually do anything with this information. You can't set _Py_HashSecret.uc
in a new Python process as doing so would break most dictionary keys set before you could do so from Python code (Python internals rely heavily on dictionaries), and your chances of the hash being equal to one of the 256**4 possible LCG values is vanishingly small.
您的想法是在任何地方将PYTHONHASHSEED
设置为已知值,这是一种更可行的方法.
Your idea to set PYTHONHASHSEED
to a known value everywhere is a far more feasible approach.
这篇关于在单元测试中提取哈希种子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!