人类可读订单代码的完美散列函数 [英] Perfect Hash Function for Human Readable Order Codes
问题描述
我正在尝试生成从(let say)一个从1开始的无符号32位内部ID,并为每个新订单自动递增的非顺序人类可读订单代码。
在我下面的示例代码中,每个 $ hash
是唯一的吗? (我打算对 $ hash
进行base34编码,使其可读取。)
<?php
函数int_hash($ key){
$ key =($ key ^ 0x47cb8a8c)^($ key<< 12);
$ key =($ key ^ 0x61a988bc)^($ key>> 19);
$ key =($ key ^ 0x78d2a3c8)^($ key<< 5);
$ key =($ key ^ 0x5972b1be)^($ key<< 9);
$ key =($ key ^ 0x2ea72dfe)^($ key<< 3);
$ key =($ key ^ 0x5ff1057d)^($ key>> 16);
return $ key;
}
($ order_id = 1; $ order_id <= PHP_INT_MAX; ++ $ order_id){
$ hash = int_hash($ order_id);
}
?>
如果没有,有没有关于如何替换 int_hash
?
结果是,base34编码一个 md5($ order_id)
太长了我的喜好
在下面的示例代码中,每个
$哈希
是唯一的?
几乎(我猜, 不,但在某种程度上很容易修复。)你的功能包括一系列独立的步骤;当且仅当这些步骤中的每一个步骤都是时,整体功能是双重的(可逆的)。 (你知道为什么吗?)
现在,每一步都有以下一种形式:
$ key =($ key ^ CONSTANT)^($ key>> NUM_BITS);
$ key =($ key ^ CONSTANT)^($ key<< NUM_BITS);
与 NUM_BITS!= 0
p>
我们可以将它们视为单一形式的变体,通过将前者视为几乎相当于此:
$ key = invert_order_of_bits($ key); #清晰的双体体
$ constant = invert_order_of_bits(CONSTANT);
$ key =($ key ^ $ constant)^($ key<< NUM_BITS);
$ key = invert_order_of_bits($ key); #清楚的双体
所以我们需要的是显示:
$ key =($ key ^ CONSTANT)^($ key<< NUM_BITS);
是双射的。现在,XOR是交换和关联的,所以上面相当于这个:
$ key = $ key ^($ key< ;< NUM_BITS);
$ key = $ key ^ CONSTANT;
和(x ^ y)^ y == x ^(y ^ y)== x ^ 0 == x
,所以清楚XOR-ing与常量值是可逆的(通过重新使用相同的值);所以我们要显示的是这是双射:
$ key = $ key ^($ key<< NUM_BITS );
每当 NUM_BITS!= 0
p>
现在,我没有写出严格的证明,所以我只是给出一个单个的例子,说明如何扭转这个。假设 $ key ^($ key< <9)
是
0010 1010 1101 1110 0010 0101 0000 1100
我们如何获得 $键
?那么,我们知道最后九位的 $ key<< 9
全部为零,所以我们知道 $ key ^($ key< 9)
的最后9位与$ code> $ key 的最后九位。所以 $ key
看起来像
bbbb bbbb bbbb bbbb bbbb bbb1 0000 1100
so $ key<< 9
看起来像
bbbb bbbb bbbb bb10 0001 1000 0000 0000
so $ key
看起来像
bbbb bbbb bbbb bb00 0011 1101 0000 1100
(通过XOR-ing $ key ^($ key< <9)
与 $ key<< 9
),所以 $ key<< 9
看起来像
bbbb b000 0111 1010 0001 1000 0000 0000
so $ key
看起来像
bbbb b010 1010 0100 0011 1101 0000 1100
所以 $ key<< 9
看起来像
0101 1000 0111 1010 0001 1000 0000 0000
so $ key
看起来像
0111 0010 1010 0100 0011 1101 0000 1100
所以。 。为什么我说几乎而不是是?为什么你的哈希函数不是完美的双目标?这是因为在PHP中,逐位移动运算符>>
和<
对称,而 $ key = $ key ^($ key<< NUM_BITS)
是完全可逆的, $ key = $键^($ key>> NUM_BITS)
不是。 (以上,当我写道两种类型的步骤是几乎相同时,我真的意味着几乎,它有所作为!)你看,而<
像任何其他位一样对待符号位,并将其转移出来(在右侧引入零位),>>
特别对待符号位,并扩展:它在左边带来的位等于符号位。 (NB你的问题提到未签名的32位值,但PHP实际上并不支持;它的按位操作总是在签名整数。)
由于此符号扩展名,如果 $ key
以 0
开头,则 $键>> NUM_BITS
以 0
开头,如果 $ key
以$ $ c $开头c> 1 ,然后 $ key>> NUM_BITS
也以 1
开头。在这两种情况下, $ key ^($ key>> NUM_BITS)
将以 0
开头。你已经失去了一点熵。如果你给我 $ key ^($ key>> 9)
,不要告诉我是否 $ key
是否定的,那么我可以做的最好的是计算两个可能的值:$ code> $ key :一个负数,一个正数或零。
您执行两个步骤,使用右移而不是左移,因此您丢失了两位熵。 (我手里挥了挥手,我实际证明的是,你至少输了一个 两位,但我有信心由于这些右移步骤之间的步骤的性质,您实际上会丢失两个完整位。)对于任何给定的输出值,只有四个不同的输入值可以产生。所以它不是独一无二的,但它几乎是独一无二的;并且可以通过以下方式轻松修复:
- 更改两个右移步骤以改用左移;或者
- 在任何左移步骤之前将两个右移步骤移动到函数的开始,并且说输出对于0和2之间的输入是唯一的− 1而不是0和2之间的输入 32 − 1。
I am trying to generate non-sequential human readable order codes derived from (lets say) a unsigned 32bit internal id that starts at 1 and is auto incremented for each new order.
In my example code below, will every $hash
be unique? (I plan to base34 encode the $hash
to make it human readable.)
<?php
function int_hash($key) {
$key = ($key^0x47cb8a8c) ^ ($key<<12);
$key = ($key^0x61a988bc) ^ ($key>>19);
$key = ($key^0x78d2a3c8) ^ ($key<<5);
$key = ($key^0x5972b1be) ^ ($key<<9);
$key = ($key^0x2ea72dfe) ^ ($key<<3);
$key = ($key^0x5ff1057d) ^ ($key>>16);
return $key;
}
for($order_id = 1; $order_id <= PHP_INT_MAX; ++$order_id) {
$hash = int_hash($order_id);
}
?>
If not, are there any suggestions on how to replace int_hash
?
The result of say, base34 encoding a md5($order_id)
is too long for my liking.
In my example code below, will every
$hash
be unique?
Almost. (Which, I guess, means "no, but in a way that's easily fixed".) Your function consists of a sequence of independent steps; the overall function is bijective (reversible) if and only if every single one of those steps is. (Do you see why?)
Now, each step has one of the following forms:
$key = ($key ^ CONSTANT) ^ ($key >> NUM_BITS);
$key = ($key ^ CONSTANT) ^ ($key << NUM_BITS);
with NUM_BITS != 0
.
We can actually treat these as variants of a single form, by viewing the former as almost equivalent to this:
$key = invert_order_of_bits($key); # clearly bijective
$constant = invert_order_of_bits(CONSTANT);
$key = ($key ^ $constant) ^ ($key << NUM_BITS);
$key = invert_order_of_bits($key); # clearly bijective
So all we need is to show that this:
$key = ($key ^ CONSTANT) ^ ($key << NUM_BITS);
is bijective. Now, XOR is commutative and associative, so the above is equivalent to this:
$key = $key ^ ($key << NUM_BITS);
$key = $key ^ CONSTANT;
and (x ^ y) ^ y == x ^ (y ^ y) == x ^ 0 == x
, so clearly XOR-ing with a constant value is reversible (by re-XOR-ing with the same value); so all we have to show is that this is bijective:
$key = $key ^ ($key << NUM_BITS);
whenever NUM_BITS != 0
.
Now, I'm not writing a rigorous proof, so I'll just give a single reasoned-out example of how to reverse this. Suppose that $key ^ ($key << 9)
is
0010 1010 1101 1110 0010 0101 0000 1100
How do we obtain $key
? Well, we know that the last nine bits of $key << 9
are all zeroes, so we know that the last nine bits of $key ^ ($key << 9)
are the same as the last nine bits of $key
. So $key
looks like
bbbb bbbb bbbb bbbb bbbb bbb1 0000 1100
so $key << 9
looks like
bbbb bbbb bbbb bb10 0001 1000 0000 0000
so $key
looks like
bbbb bbbb bbbb bb00 0011 1101 0000 1100
(by XOR-ing $key ^ ($key << 9)
with $key << 9
), so $key << 9
looks like
bbbb b000 0111 1010 0001 1000 0000 0000
so $key
looks like
bbbb b010 1010 0100 0011 1101 0000 1100
so $key << 9
looks like
0101 1000 0111 1010 0001 1000 0000 0000
so $key
looks like
0111 0010 1010 0100 0011 1101 0000 1100
So . . . why do I say "almost" rather than "yes"? Why is your hash-function not perfectly bijective? It's because in PHP, the bitwise shift operators >>
and <<
are not quite symmetric, and while $key = $key ^ ($key << NUM_BITS)
is completely reversible, $key = $key ^ ($key >> NUM_BITS)
is not. (Above, when I wrote that the two types of steps were "almost equivalent", I really meant that "almost". It makes a difference!) You see, whereas <<
treats the sign bit just like any other bit, and shifts it out of existence (bringing in a zero-bit on the right), >>
treats the sign bit specially, and "extends" it: the bit that it brings in on the left is equal to the sign bit. (N.B. Your question mentions "unsigned 32bit" values, but PHP doesn't actually support that; its bitwise operations are always on signed integers.)
Due to this sign extension, if $key
starts with a 0
, then $key >> NUM_BITS
starts with a 0
, and if $key
starts with a 1
, then $key >> NUM_BITS
also starts with a 1
. In either case, $key ^ ($key >> NUM_BITS)
will start with a 0
. You've lost exactly one bit of entropy. If you give me $key ^ ($key >> 9)
, and don't tell me whether $key
is negative, then the best I can do is compute two possible values for $key
: one negative, one positive-or-zero.
You perform two steps that use right-shift instead of left-shift, so you lose two bits of entropy. (I'm hand-waving slightly — all I've actually demonstrated is that you lose at least one bit and at most two bits — but I'm confident that, due to the nature of the steps between those right-shift steps, you do actually lose two full bits.) For any given output value, there are exactly four distinct input-values that could yield it. So it's not unique, but it's almost unique; and it's easily fixed, by either:
- changing the two right-shift steps to use left-shifts instead; or
- moving both of the right-shift steps to the start of the function, before any left-shift steps, and saying that outputs are unique for inputs between 0 and 231−1 rather than inputs between 0 and 232−1.
这篇关于人类可读订单代码的完美散列函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!