哈希表实现 [英] Hash table implementation

查看:110
本文介绍了哈希表实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚买了一本书C接口和实现。
在第一章中,先后实施了凌动结构,样品code如下:

 的#define NELEMS(X)((的sizeof(X))/(sizeof的((X)[0])))
静态结构原子{
    结构原子*链接;
    INT LEN;
    字符*海峡;
} *水桶[2048];
静态无符号长散射[] = {
2078917053,143302914,1027100827,1953210302,755253631,2002600785,
1405390230,45248011,1099951567,433832350,2018585307,438263339,
813528929,1703199216,618906479,573714703,766270699,275680090,
1510320440,1583583926,1723401032,1965443329,1098183682,1636505764,
980071615,1011597961,643279273,1315461275,157584038,1069844923,
471560540,89017443,1213147837,1498661368,2042227746,1968401469,
1353778505,1300134328,2013649480,30624642​​4,1733966678,1884751139,
744509763,400011959,1440466707,1363416242,973726663,59253759,
1639096332,336563455,1642837685,1215013716,154523136,593537720,
704035832,1134594751,1605135681,1347315106,302572379,1762719719,
269​​676381,774132919,1851737163,1482824219,125310639,1746481261,
1303742040,1479089144,899131941,1169907872,1785335569,485614972,
907175364,382361684,885626931,200158423,1745777927,1859353594,
259412182,1237390611,48433401,1902249868,304920680,202956538,
348303940,1008956512,1337551289,1953439621,208787970,1640123668,
1568675693,478464352,266772940,1272929208,1961288571,392083579,
871926821,1117546963,1871172724,1771058762,139971187,1509024645,
109190086,1047146551,1891386329,994817018,1247304975,1489680608,
706686964,1506717157,579587572,755120366,1261483377,884508252,
958076904,1609787317,1893464764,148144545,1415743291,2102252735,
1788268214,836935336,433233439,2055041154,2109864544,247038362,
299641085,834307717,1364585325,23330161,457882831,1504556512,
1532354806,567072918,404219416,1276257488,1561889936,1651524391,
618454448,121093252,1010757900,1198042020,876213618,124757630,
2082550272,1834290522,1734544947,1828531389,1982435068,1002804590,
1783300476,1623219634,1839739926,69050267,1530777140,1802120822,
316088629,1830418225,488944891,1680673954,1853748387,946827723,
1037746818,1238619545,1513900641,1441966234,367393385,928306929,
946006977,985847834,1049400181,1956764878,36406206,1925613800,
2081522508,2118956479,1612420674,1668583807,1800004220,1447372094,
523904750,1435821048,923108080,216161028,1504871315,306401572,
2018281851,1820959944,2136819798,359743094,1354150250,1843084537,
1306570817,244413420,934220434,672987810,1686379655,1301613820,
1601294739,484902984,139978006,503211273,294184214,176384212,
281341425,228223074,147857043,1893762099,1896806882,1947861263,
1193650546,273227984,1236198663,2116758626,489389012,593586330,
275676551,360187215,267062626,265012701,719930310,1621212876,
2108097238,2026501127,1865626297,894834024,552005290,1404522304,
48964196,5816381,1889425288,188942202,509027654,36125855,
365326415,790369079,264348929,513183458,536647531,13672163,
313561074,1730298077,286900147,1549759737,1699573055,776289160,
2143346068,1975249606,1136476375,262925046,92778659,1856406685,
1884137923,53392249,1735424165,1602280572
};
为const char * Atom_new(为const char * str中,INT LEN){
    无符号长H;
    INT I;
    原子结构* P;
    断言(STR);
    断言(LEN&GT = 0);
    对于(H = 0,I = 0; I< LEN,我++)
        H =(H<< 1)+散射[(无符号字符)海峡[I]];
    的H& = NELEMS(桶)-1;
    为(P =桶[H]; P; P = P-GT&;链接)
        如果(LEN == p型> LEN){
            对于(i = 0; I<&LEN功放;&安培; P-> STR [I] ==海峡[I])
                我++;
            如果(我== LEN)
                返回P-> STR;
        }
    p值= ALLOC(sizeof的(* P)+ LEN + 1);
    P-> LEN = LEN;
    对 - >海峡=(字符*)(P + 1);
    如果(LEN大于0)
        的memcpy(对GT; STR,STR,LEN);
    P->海峡[长度] ='\\ 0';
    P->链接=桶[H]
    水桶[H] = P;在列表的前面插入//原子
    返回P-> STR;
}

在本章结尾,在练习3.1,这本书的作者说:
大多数文本建议使用一个素数的大小
桶。使用质和良好的散列函数通常给一个
更好的分配名单挂桶的长度。
原子使用二的幂,有时明确地引
作为一个不错的选择。编写一个程序来生成或读取,比如一
典型的串并测量Atom_new的速度和分配
的列表的长度。然后改变值区,以便它具有
2039项(最大素小于2048),并重复该
测量。是否使用黄金的帮助?多少你
结论取决于您的特定机器上?

所以我做了哈希表的大小变更为2039,但它似乎实际上是由一个素数
列表的长度不好的分布,我已经尝试了64,61,61竟做了一个错误的分布了。

我只是想知道,为什么一个素表的大小做一个坏的分布,这是因为使用Atom_new散列函数不好的散列函数?

我使用这个功能来打印出原子列表的长度

 的#define B_SIZE 2048
无效Atom_print(无效)
{
    INT I,T;
    结构原子*原子。
    对于(i = 0; I< B_SIZE;我++){
        T = 0;
        对于(原子=桶[I];原子;原子= atom->链接){
            ++吨;
        }
        的printf(%D,T);
    }
}


解决方案

好了,沿着时间以前,我必须实现一个哈希表(在驱动开发),和我差不多。见鬼,为什么我应该使用一个素数? 2 OTOH功率为更好 - 而不是2的幂,你可以使用位与情况下计算模

所以,我实现了这样一个哈希表。关键是一个指针(某些第三方函数返回)。于是,最终我发现,在我哈希表只有在所有条目的1/4被填满。因为我使用的哈希函数是恒等函数,和以防万一的事实证明,所有返回的指针是4的倍数。

使用质数作为哈希表大小的想法是这样的:真实世界的散列函数的不要产生同样分布值。一般有(或者至少可能有)某些依赖性。因此,为了的的这种分配它建议使用质数。

顺便说一句,在理论上有可能发​​生,偶尔散列函数将生产出您所选择的质数的倍数的数字。但这个概率比如果它不是一个素数低。

I just bought a book "C Interfaces and Implementations". in chapter one , it has implemented a "Atom" structure, sample code as follow:

#define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0])))
static struct atom {
    struct atom *link;
    int len;
    char *str;
} *buckets[2048];
static unsigned long scatter[] = {
2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785,
1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339,
813528929, 1703199216, 618906479, 573714703, 766270699, 275680090,
1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764,
980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923,
471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469,
1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139,
744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759,
1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720,
704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719,
269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261,
1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972,
907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594,
259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538,
348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668,
1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579,
871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645,
109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608,
706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252,
958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735,
1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362,
299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512,
1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391,
618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630,
2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590,
1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822,
316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723,
1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929,
946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800,
2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094,
523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572,
2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537,
1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820,
1601294739, 484902984, 139978006, 503211273, 294184214, 176384212,
281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263,
1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330,
275676551, 360187215, 267062626, 265012701, 719930310, 1621212876,
2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304,
48964196, 5816381, 1889425288, 188942202, 509027654, 36125855,
365326415, 790369079, 264348929, 513183458, 536647531, 13672163,
313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160,
2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685,
1884137923, 53392249, 1735424165, 1602280572
};
const char *Atom_new(const char *str, int len) {
    unsigned long h;
    int i;
    struct atom *p;
    assert(str);
    assert(len >= 0);
    for (h = 0, i = 0; i < len; i++)
        h = (h<<1) + scatter[(unsigned char)str[i]];
    h &= NELEMS(buckets)-1;
    for (p = buckets[h]; p; p = p->link)
        if (len == p->len) {
            for (i = 0; i < len && p->str[i] == str[i]; )
                i++;
            if (i == len)
                return p->str;
        }
    p = ALLOC(sizeof (*p) + len + 1);
    p->len = len;
    p->str = (char *)(p + 1);
    if (len > 0)
        memcpy(p->str, str, len);
    p->str[len] = '\0';
    p->link = buckets[h];
    buckets[h] = p;//insert atom in front of list
    return p->str;
}

at end of chapter , in exercises 3.1, the book's author said "Most texts recommend using a prime number for the size of buckets. Using a prime and a good hash function usually gives a better distribution of the lengths of the lists hanging off of buckets. Atom uses a power of two, which is sometimes explicitly cited as a bad choice. Write a program to generate or read, say, 10,000 typical strings and measure Atom_new’s speed and the distribution of the lengths of the lists. Then change buckets so that it has 2,039 entries (the largest prime less than 2,048), and repeat the measurements. Does using a prime help? How much does your conclusion depend on your specific machine?"

so I did changed that hash table size to 2039,but it seems a prime number actually made a bad distribution of the lengths of the lists, I have tried 64, 61, 61 actually made a bad distribution too.

I am just want to know why a prime table size make a bad distribution, is this because the hash function used with Atom_new a bad hash function?

I am using this function to print out the lengths of the atom lists

#define B_SIZE 2048
void Atom_print(void)
{
    int i,t;
    struct atom *atom;
    for(i= 0;i<B_SIZE;i++) {
        t = 0;
        for(atom=buckets[i];atom;atom=atom->link) {
            ++t;
        }
        printf("%d ",t);
    }
}

解决方案

Well, along time ago I had to implement a hash table (in driver development), and I about the same. Why the heck should I use a prime number? OTOH power of 2 is even better - instead of calculating the modulus in case of power of 2 you may use bitwise AND.

So I've implemented such a hash table. The key was a pointer (returned by some 3rd-party function). Then, eventually I noticed that in my hash table only 1/4 of all the entries is filled. Because that hash function I used was identity function, and just in case it turned out that all the returned pointers are multiples of 4.

The idea of using the prime numbers for the hash table size is the following: real-world hash functions do not produce equally-distributed values. Usually there's (or at least there may be) some dependency. So, in order to diffuse this distribution it's recommended to use prime numbers.

BTW, theoretically there may happen that occasionally the hash function will produce the numbers that are multiples of your chosen prime number. But the probability of this is lower than if it was not a prime number.

这篇关于哈希表实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆