简单的哈希函数 [英] Simple hash functions

查看:133
本文介绍了简单的哈希函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试着编写使用哈希表来存储不同的单词一个C程序,我可以使用一些帮助。

I'm trying to write a C program that uses a hash table to store different words and I could use some help.

首先,创建具有一个素数这是关闭的I有存储字的数量的大小的哈希表,然后我使用一个散列函数来找到一个地址为每个字。
我开始用最简单的功能,增添的字母在一起,这结束了88%的碰撞。
然后,我开始与功能的实验和发现,无论我将其更改为,碰撞不要低于35%。
现在,我使用

Firstly, I create a hash table with the size of a prime number which is the closes to the number of the words I have to store, and then I use a hash function to find an address for each word. I started with the simplest function, adding the letters together, which ended up with 88% collision. Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%. Right now I'm using

unsigned int stringToHash(char *word, unsigned int hashTableSize){
  unsigned int counter, hashAddress =0;
  for (counter =0; word[counter]!='\0'; counter++){
    hashAddress = hashAddress*word[counter] + word[counter] + counter;
  }
  return (hashAddress%hashTableSize);
}

这就是我想出了一个随机函数,但它给了我最好的结果 - 约35%的碰撞

which is just a random function that I came up with, but it gives me the best results - around 35% collision.

我一直在阅读上的散列函数的文章,在过去的几个小时,我试图用一些简单的,如djb2,但他们给了我,甚至更坏的结果。(djb2导致了37%的碰撞,这岂不等于差多少,但我期待更好的东西,而不是更糟)
我也不知道如何使用一些其他的,更复杂的,如murmur2的,因为我不知道是什么参数(键,LEN,种子),他们参加的。

I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse) I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.

这是正常得到35%以上的碰撞,即使使用djb2,还是我做错了什么?
什么是关键,LEN和种子值?

Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong? What are the key, len and seed values?

感谢您提前。

推荐答案

尝试SDBM:

hashAddress = 0;
for (counter = 0; word[counter]!='\0'; counter++){
    hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress;
}

或者djb2:

hashAddress = 5381;
for (counter = 0; word[counter]!='\0'; counter++){
    hashAddress = ((hashAddress << 5) + hashAddress) + word[counter];
}

或者的Adler32:

Or Adler32:

uint32_t adler32(const void *buf, size_t buflength) {
     const uint8_t *buffer = (const uint8_t*)buf;

     uint32_t s1 = 1;
     uint32_t s2 = 0;

     for (size_t n = 0; n < buflength; n++) {
        s1 = (s1 + buffer[n]) % 65521;
        s2 = (s2 + s1) % 65521;
     }     
     return (s2 << 16) | s1;
}

// ...

hashAddress = adler32(word, strlen(word));

这些都不是真正伟大的,虽然。如果你真的想好哈希值,则需要例如更复杂的东西如 lookup3

请注意,一个哈希表,预计一旦有大量的碰撞,因为它是充满的超过70-80%的。这是完全正常的,如果你使用一个很好的哈希算法,甚至会发生。这就是为什么大多数哈希表实现(如容量* 1.5 甚至容量* 2 )尽快提高哈希表的能力当你添加一些东西到Hashtable和比例尺寸/容量已经在0.7以上至0.8。增加容量装置一个新哈希表与一个更高容量的创建,从当前的所有值被添加到新的(为此它们必须全部重新处理,因为它们的新的索引将是在大多数情况下不同),新hastable阵列取代旧的和旧的被释放/释放。如果你打算散列1000字,在1250至少推荐一个哈希表的容量,更好的1400甚至1500。

Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5 or even capacity * 2) as soon as you are adding something to the hashtable and the ratio size / capacity is already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hastable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.

哈希表进行填充边缘,至少不是他们是否应该快速,高效的(因此他们总是应该有余力)。这是哈希表的小型化,它们是快速( O(1)),但是它们通常会浪费更多的空间比将存储在另一结构相同的数据是必要的(当您将它们保存为一个有序数组,你只需要1000字的容量为1000;的缩小是查找不能超过 O(log n)的的更快案件)。无冲突哈希表是不可能在大多数情况下,无论哪种方式。 pretty多所有哈希表的实现期待碰撞发生,通常有某种方式来对付他们(通常碰撞使查找略慢,但哈希表仍然可以工作,仍然在许多情况下击败其他数据结构)。

Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downsize of hashtables, they are fast (O(1)), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downsize is that the lookup cannot be faster than O(log n) in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).

另外请注意,如果您使用的是pretty好的散列函数,没有要求,但即使不是一个优势,如果哈希表有2容量的电源,如果您使用的是模种植散列值(<$最终C $ C>%)。为什么许多哈希表实现始终使用2容量电力的原因是因为他们不使用模,而是使用AND(&安培; )进行裁剪因为AND操作,你会发现大多数的CPU速度最快的操作中(模是决不会比更快,在最好的情况下,它也同样迅速,在大多数情况下,它是慢了很多)。如果你的哈希表使用2种尺寸的力量,你可以替换与运算的任何模块:

Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:

x % 4  == x & 3
x % 8  == x & 7
x % 16 == x & 15
x % 32 == x & 31
...

这仅适用于2种尺寸的电源,虽然。如果您使用的取模,2种尺寸的电源只能买东西,如果哈希是一个非常糟糕的散列一个非常糟糕的位分配。一个坏的位分布通常是由不使用任何类型的位移(&GT哈希值造成的;&GT; &LT;&LT; )或任何其他的操作,将具有如比特移位类似的效果。

This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>> or <<) or any other operations that would have a similar effect as bit shifting.

我创建了一个剥离下来lookup3实现你:

I created a stripped down lookup3 implementation for you:

#include <stdint.h>
#include <stdlib.h>

#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))

#define mix(a,b,c) \
{ \
  a -= c;  a ^= rot(c, 4);  c += b; \
  b -= a;  b ^= rot(a, 6);  a += c; \
  c -= b;  c ^= rot(b, 8);  b += a; \
  a -= c;  a ^= rot(c,16);  c += b; \
  b -= a;  b ^= rot(a,19);  a += c; \
  c -= b;  c ^= rot(b, 4);  b += a; \
}

#define final(a,b,c) \
{ \
  c ^= b; c -= rot(b,14); \
  a ^= c; a -= rot(c,11); \
  b ^= a; b -= rot(a,25); \
  c ^= b; c -= rot(b,16); \
  a ^= c; a -= rot(c,4);  \
  b ^= a; b -= rot(a,14); \
  c ^= b; c -= rot(b,24); \
}

uint32_t lookup3 (
  const void *key,
  size_t      length,
  uint32_t    initval
) {
  uint32_t  a,b,c;
  const uint8_t  *k;
  const uint32_t *data32Bit;

  data32Bit = key;
  a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;

  while (length > 12) {
    a += *(data32Bit++);
    b += *(data32Bit++);
    c += *(data32Bit++);
    mix(a,b,c);
    length -= 12;
  }

  k = (const uint8_t *)data32Bit;
  switch (length) {
    case 12: c += ((uint32_t)k[11])<<24;
    case 11: c += ((uint32_t)k[10])<<16;
    case 10: c += ((uint32_t)k[9])<<8;
    case 9 : c += k[8];
    case 8 : b += ((uint32_t)k[7])<<24;
    case 7 : b += ((uint32_t)k[6])<<16;
    case 6 : b += ((uint32_t)k[5])<<8;
    case 5 : b += k[4];
    case 4 : a += ((uint32_t)k[3])<<24;
    case 3 : a += ((uint32_t)k[2])<<16;
    case 2 : a += ((uint32_t)k[1])<<8;
    case 1 : a += k[0];
             break;
    case 0 : return c;
  }
  final(a,b,c);
  return c;
}

这code未对业绩为原来的code作为高度优化的,为此就简单了很多。它也不像原来的code作为便携,但它今天移植到所有主要的消费平台使用。它也完全忽略了CPU端,然而这是不是一个真正的问题,这将在大,小端的CPU工作。请记住,它不会计算在大,小端的CPU相同的数据相同的散列,但没有要求;它会计算好的哈希上都种CPU和其唯一重要,它始终计算一台机器上相同的输入数据相同的哈希。

This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.

如下你可以使用此功能:

You would use this function as follows:

unsigned int stringToHash(char *word, unsigned int hashTableSize){
  unsigned int initval;
  unsigned int hashAddress;

  initval = 12345;
  hashAddress = lookup3(word, strlen(word), initval);
  return (hashAddress%hashTableSize);
  // If hashtable is guaranteed to always have a size that is a power of 2,
  // replace the line above with the following more effective line:
  //     return (hashAddress & (hashTableSize - 1));
}

您的方式不知道什么 initval 是。嗯,这是你希望它是什么。你可以说它是盐。它会影响散列值,但在由于这种质量的哈希值不会得到更好或更差(至少在平均情况下,它可能会导致非常具体的数据更多或更少的碰撞,虽然)。例如。如果你想两次散列相同的数据,可以使用不同的 initval 值,但每次都应该产生不同的散列值(没有保证会,但比较如果可能的 initval 是不同的;如果它创建了相同的值,这将是你必须把它看成一种碰撞)一个非常不走运一致。这是不可取的相同哈希表散列数据时(这将造成相当平均更多的碰撞)使用不同的 initval 值。对于initval的另一个用途是,如果你想一个散列与其他一些数据,在这种情况下,现有的哈希成为结合 initval 散列其它数据(当这样两个,其他数据以及previous散列影响散列函数的结果)。你甚至可以设置 initval 0 如果你喜欢,也可以选择随机值创建哈希表时(并始终使用对于哈希表的这个实例这个随机值,但每个哈希表都有自己的随机值)。

You way wonder what initval is. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initval values if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initval is different; if it creates the same value, this would be a very unlucky coincident that you must treat that as a kind of collision). It is not advisable to use different initval values when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initval when hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initval to 0 if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).

相撞的说明:

碰撞通常在实践中没有这样一个巨大的问题,它通常不还清浪费吨的记忆,只是为了避免他们。现在的问题是在于你将如何以有效的方式来对付他们。

Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.

您说您正在处理9000字。如果您使用未排序阵列,找到该阵列中的字将需要平均4500比较。在我的系统,4500的字符串比较(假设的话是3至20个字符之间)需要38微秒(0.000038秒)。因此,即使这样简单的,无效的算法是速度不够快,对于大多数的目的。假设你正在排序单词列表,并使用二进制搜索中,阵列中找到一个单词将只需要13上平均比较。 13比较接近没有在时间上,它的太少,甚至基准可靠。因此,如果在哈希表中查找单词需要2〜4个比较,我甚至不会浪费在这个问题一分一秒是否可能是一个巨大的性能问题。

You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.

在你的情况下,二进制搜索排序列表甚至可能是迄今为止打了一个哈希表。当然,13比较需要比2-4比较的详细时间,但是,在一个哈希表的情况下,你必须首先散列输入数据进行查找。单独散列可能已经花费的时间超过13攀比!该的更好的散列的的的将采取相同数量的数据进行哈希运算。所以,如果你有一个非常庞大的数据量的哈希表只不负有心人性能明智的,或者如果你必须经常更新数据(例如不断添加/从表中删除的话/,因为这些操作是一个哈希表比他们更便宜是要排序的列表)。事实上,一个hashatble是 O(1)只意味着它无论有多大,查找就会约。总是需要的时间相同。 O(log n)的不仅意味着查找文字的数量,这意味着更多的话,查找速度较慢对数增长。然而,大O符号只字未提绝对速度!这是一个很大的误区。这是不是说,一个 O(1)算法始终执行比 O(log n)的的速度更快。大O符号只是告诉你,如果 O(log n)的算法是更快了一定数目的值,你不断增加值的数量,在 O(1)算法肯定将超过 O(log n)的算法在一定的时间点,但当前的字计数可能是远低于这一点。如果没有标杆两种方法,你不能说哪一个是刚才看大O符号更快。

In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The better the hash, the longer it will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1) only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n) only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1) algorithm always performs faster than a O(log n) one. The Big-O notation only tells you that if the O(log n) algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1) algorithm will certainly overtake the O(log n) algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.

返回碰撞。如果碰上碰撞你应该怎么做?如果冲突的数量少,并且在这里我并不意味着碰撞的总数目(即在散列表碰撞的字的数量),但存储在同一哈希索引词的每个索引酮(的数量,因此你的情况也许2-4)中,最简单的方法是将它们存储为链接列表。如果有,到目前为止此表的索引没有碰撞中,只有一个键/值对。如果有冲突,有键/值对的链接列表。在这种情况下,你的code必须遍历链表,验证每个按键的,如果它匹配返回值。通过你的号码去,这个链表不会有超过4项,做4比较在性能方面是微不足道的。因此,找到该指数 O(1),发现价值(或检测这一关键不在表) O(N),但在这里, N 只(所以它是4最多)。链接列表中的条目数

Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1), finding the value (or detecting that this key is not in the table) is O(n), but here n is only the number of linked list entries (so it is 4 at most).

如果冲突数量加薪,链表会变得缓慢,你也可以存储键/值对的动态调整,分类排列,这使得的查找O(log n)的,并再次, N 只是按键的数组中不中hastable所有密钥数量。即使有100碰撞的一个索引,找到正确的键/值对需要最多7比较。这仍然几乎为零。尽管事实上,如果你真有100的冲突在一个索引,要么你的哈希算法是不适合你的关键数据或哈希表实在太小容量。动态大小,排序数组的缺点是添加/删除键比链表的情况下,有些更多的工作(code-明智的,不一定是性能明智)。因此,使用链表,如果你继续碰撞足够低的数目,它几乎是微不足道的自己用C实现这样一个链表,并将其添加到现有的哈希表的实现通常是足够的。

If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n) and again, n is only the number of keys in that array, not of all keys in the hastable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.

大多数哈希表实现我似乎用这样的后退到备用数据结构,以应对碰撞。的缺点是,这些都需要一点点的额外存储器来存储备用数据结构和多个位code键还搜索在该结构的键。也有存储哈希表本身内部碰撞和不要求任何附加的存储器解决方案。然而,这些解决方案有几个缺点。第一缺点是,每一次碰撞增加了甚至更多的碰撞的机率随着更多数据被添加。第二个缺点是,当钥匙查找次碰撞的增多而线性下降,到目前为止(正如我之前所说,每一次碰撞会导致更多的冲突将数据添加),查询时间的键不在哈希表中减少甚至更糟而在最后,如果执行的关键,是不是在哈希表(但你可以不知道不执行查找)查找,查找可能采取只要在整个哈希表中的线性搜索(YUCK!) 。所以,如果你能抽出更多的内存,去替代结构来处理冲突。

Most hashtable implementations I have seem use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.

这篇关于简单的哈希函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆