为什么MULT 31(字符串的哈希函数)? [英] Why MULT 31 (hash function for string)?

查看:119
本文介绍了为什么MULT 31(字符串的哈希函数)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,

有一个经典的散列函数来散列字符串,其中MULT被定义为

为31:


//来自编程珍珠

unsigned int hash(char * ptr)

{unsigned int h = 0;

unsigned char * p = ptr;

int n;

for(n = k; n 0; p ++){

h = MULT * h + * p;

如果(* p == 0)

n--;

}

返回h %NHASH;

}


为什么MULT定义为31? (怎么样29?24?或26?)

谢谢,

文杰

Hi there,
There''s a classic hash function to hash strings, where MULT is defined
as "31":

//from programming pearls
unsigned int hash(char *ptr)
{ unsigned int h = 0;
unsigned char *p = ptr;
int n;
for (n = k; n 0; p++) {
h = MULT * h + *p;
if (*p == 0)
n--;
}
return h % NHASH;
}

Why MULT defined as 31? ( How about 29? 24? or 26? )
Thanks,
Wenjie

推荐答案

go****@yahoo.com 说:

你好,


有一个经典的散列函数来散列字符串,其中定义了MULT

为31:
Hi there,
There''s a classic hash function to hash strings, where MULT is defined
as "31":



< snip>

<snip>


>

为什么MULT定义为31? (29?24?或26?)
>
Why MULT defined as 31? ( How about 29? 24? or 26? )



为什么不自己找出来?通用散列例程旨在为您提供任意数据散列的适当散列。因此,制作几百万美元b $ b b的任意数据记录,看看你得到的各种

乘数的差价。


如果你称之为研究,也许你甚至可以欺骗别人付钱给你。


-

Richard Heathfield

" ; Usenet是一个奇怪的地方 - dmr 29/7/1999
http://www.cpax.org.uk

电子邮件:rjh在上面的域名(但显然放弃了www)

Why not find out yourself? Generic hashing routines are intended to get you
a decent spread of hashes for arbitrary data. So cook up a few million
records of arbitrary data, and see what kind of spreads you get for various
multipliers.

If you call it research, maybe you can even fool someone into paying you.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)


go **** @ yahoo.com 写道:

你好,


有一个哈希字符串的经典哈希函数,其中MULT定义为

为31:


//来自编程珍珠

unsigned int hash(char * ptr)

{unsigned int h = 0;

unsigned char * p = ptr;

int n;

for(n = k; n 0; p ++){

h = MULT * h + * p;

if(* p == 0)

n--;

}

返回h%NHASH;

}
Hi there,
There''s a classic hash function to hash strings, where MULT is defined
as "31":

//from programming pearls
unsigned int hash(char *ptr)
{ unsigned int h = 0;
unsigned char *p = ptr;
int n;
for (n = k; n 0; p++) {
h = MULT * h + *p;
if (*p == 0)
n--;
}
return h % NHASH;
}



看起来这段代码可能被错误转录了。

有这个未定义的数量`k''浮动周围,​​并且

str结束时的行为ing只能被称为破碎。

但这对你的问题似乎并不重要:

It looks like this code was probably mis-transcribed.
There''s this undefined quantity `k'' floating around, and
the behavior at end-of-string can only be called broken.
But that doesn''t seem central to your question:


为什么MULT定义为31? (29?24?或26?)
Why MULT defined as 31? ( How about 29? 24? or 26? )



它是迷信和良好感觉的混合物。


首先,迷信:用哈希表编写具有

的代码的人显然回想起素数

特别好。对他们来说看来他们并不总是记得善良是什么。是或者是什么联系,

但他们只要他们支付
就可以把素数投入到混合中。他们会扔掉素数,即使他们不是太多了。

确定素数是多少!我的一位同事曾经在这个小小的编码宝石上跑了




#define HASHSIZE 51 / *小筹码* /

其次,良好的意义:假设MULT是26,并考虑

哈希一百个字符的字符串。在mod操作之前,字符串的第一个字符对'h''的最终值有多大的影响?

?第一个字符'的值

将乘以MULT 99次,所以如果算术

以无限精度完成,则该值将包含一些

jumble of bits后跟99个低阶零位 - 每次

你乘以MULT你会引入另一个低阶零,对吗?

计算机的有限算术只是将所有多余的

高阶位切掉,所以第一个字符的实际贡献是

`h''。 ..恰好为零! 'h''值仅取决于

最右边的32个字符串字符(假设32位int),甚至

然后事情并不精彩:第一个那些最后的32个字节

只影响最左边的h,并且对剩余的31个没有影响。

显然,一个偶数值的MULT是个坏主意。


需要MULT成为素数?不是我所知道的(我不知道

一切);任何奇怪的价值都应该足够了。 31可能很有吸引力

因为它接近2的幂,并且编译器用
替换可能很慢的乘法指令可能更容易


a移位和减去(31 * x ==(x << 5) - x)在机器上,它是
产生差异。将MULT设置为大于2的幂b / b
(例如33)也很容易优化,但可能产生太多

简单。一个安排:主要是两个副本的并置

的原始位组合,中间稍微混合。

所以你想要一个奇怪的MULT有很多一位。


你还想要一个涂抹的MULT。它的操作数位与你能管理的h一样多,因此MULT不应该太小

(考虑MULT的退化情况== 1)。另外,MULT不应该b $ b太大 - 换句话说,UINT_MAX-MULT不应该太小了。多小是太小在某种程度上取决于字符串的预期长度

;我怀疑31是太小

如果字符串很短(值没有时间入侵

'h'的高阶部分)。我认为在sqrt(UINT_MAX)和UINT_MAX-sqrt(UINT_MAX)之间选择

MULT会更明智,

确保它很奇怪且有很多一个位。 Primality并没有在这里看起来很重要 - 但也许其他人可能会提供一个好的

选择素数的理由。有时迷信有根源。


-

Eric Sosman
es ***** @ acm-dot-org.inva 盖子

It''s a mixture of superstition and good sense.

First, the superstition: People who write code having
to do with hash tables apparently recall that prime numbers
are particularly "good" for them. It seems they don''t always
remember just what the "goodness" was or in what connection,
but they''ll throw prime numbers into the mix whenever they
can. They''ll throw in prime numbers even if they''re not too
sure what a prime number is! A colleague of mine once ran
across this little coding gem:

#define HASHSIZE 51 /* a smallish prime */

Second, the good sense: Suppose MULT were 26, and consider
hashing a hundred-character string. How much influence does
the string''s first character have on the final value of `h'',
just before the mod operation? The first character''s value
will have been multiplied by MULT 99 times, so if the arithmetic
were done in infinite precision the value would consist of some
jumble of bits followed by 99 low-order zero bits -- each time
you multiply by MULT you introduce another low-order zero, right?
The computer''s finite arithmetic just chops away all the excess
high-order bits, so the first character''s actual contribution to
`h'' is ... precisely zero! The `h'' value depends only on the
rightmost 32 string characters (assuming a 32-bit int), and even
then things are not wonderful: the first of those final 32 bytes
influences only the leftmost bit of `h'' and has no effect on
the remaining 31. Clearly, an even-valued MULT is a poor idea.

Need MULT be prime? Not as far as I know (I don''t know
everything); any odd value ought to suffice. 31 may be attractive
because it is close to a power of two, and it may be easier for
the compiler to replace a possibly slow multiply instruction with
a shift and subtract (31*x == (x << 5) - x) on machines where it
makes a difference. Setting MULT one greater than a power of two
(e.g., 33) would also be easy to optimize, but might produce too
"simple" an arrangement: mostly a juxtaposition of two copies
of the original set of bits, with a little mixing in the middle.
So you want an odd MULT that has plenty of one-bits.

You''d also like a MULT that "smears" its operand bits across
as much of `h'' as you can manage, so MULT shouldn''t be too small
(consider the degenerate case of MULT==1). Also, MULT shouldn''t
be too large -- to put it differently, UINT_MAX-MULT shouldn''t be
too small. How small is "too small" depends to some extent on
the expected lengths of the strings; I suspect 31 is "too small"
if the strings are short (the values won''t have time to invade
the high-order part of `h''). I think it would be wiser to choose
MULT somewhere between sqrt(UINT_MAX) and UINT_MAX-sqrt(UINT_MAX),
making sure it''s odd and has lots of one-bits. Primality doesn''t
seem important here -- but perhaps someone else may offer a good
reason to choose a prime. Sometimes superstition has valid roots.

--
Eric Sosman
es*****@acm-dot-org.invalid


感谢Eric的好消息哈希乘数的破坏。我们很多人都不会对这个领域的权衡有一个良好的感觉。

FWIW:许多Borland编译器(如果你没有,那就很快了)已知

知道),使用乘数为13的乘数。用于散列。

Thanks Eric for a good rundown of hash multipliers. A lot of us don''t
have a good feel for the tradeoffs in this area.
FWIW: Many of the Borland compilers (wickedly fast if you don''t already
know), used a multiplier of "13" for hashing.


这篇关于为什么MULT 31(字符串的哈希函数)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆