衡量一个散列函数质量(配图使用/ assosiative阵列) [英] Measuring a hash functions quality (for use with maps/assosiative arrays)

查看:216
本文介绍了衡量一个散列函数质量(配图使用/ assosiative阵列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我寻找到C中的关联数组库(我没有写)。类似C语言的地图++或Python的字典的。

I'm looking into an associative array library in C (which I didn't write). Similar to a map in C++ or Python's dict's.

有一些不规范的散列函数,我不能肯定,如果他们都十分的优秀。 (也许原来的开发商只是把一些魔法数字,异或运算符和最好的希望)的。

There are some non-standard hashing functions which I'm not certain if they are even very good. (maybe the original developer just threw some magic-numbers, xor-operators and hoped for the best).

我写了一个测试,测量散列函数的执行方式以及给予一定的样本输入,来衡量它是如何均匀地(在这种情况下,模数组大小)的项目分配到水桶固定数目的。

I wrote a test which measures how well the hashing function performs given some sample input, to measure how evenly it distributes items into a fixed number of buckets (modulo array size in this case).

这种方式,给予足够的投入,会有一些方法来衡量如何散列函数执行。

This way, given enough input, there would be some way to measure how well a hash function performs.

看起来这一定是人写的关联数组一个常见的​​问题。

It seems like it must be a common issue for anyone writing an associative array.

有一些约定,用以测量以及散列函数执行? (配送质量,而不是速度方面)的。

Is there some convention for measuring how well a hash function performs? (in terms of distribution quality, not speed).

其中最差将是相同的结果对每个输入,并且最好将给予的均匀分布(或尽可能接近)。

Where worst would be the same result for every input, and best would give an even distribution (or as close as possible).

注意,我不是在寻找在这里加密强度。

推荐答案

有一个(以页面从龙书中间)。

There is a Formula (in the middle of the page) from the dragon book.

我个人有一个经验法则:(假设线性链)将N个项目为N slots->链,并计算访问总数(第一链:= 1访问; 2:= 2的访问,等等)需要获得所有 N个元素。 (这等于SUM(chainlen *(chainlen +1)/ 2),求和所有连锁)

I personally have a rule of thumb: (assuming linear chaining) Insert N items into N slots->chains, and compute the total number of accesses (first in chain := 1 access; 2nd := 2 accesses, etc) needed to obtain all N elements. (this is equal SUM ( chainlen * (chainlen +1) /2) , summed over all chains)

由于随机输入数据,对于任何合理的散列函数这个指标应该是1.5 * N,或略低于这一点。

Given random input data, for any reasonable hashfunction this metric should be 1.5 * N, or a bit below that.

使用2543846独特的标记列表的典型运行示例/字(和他们的统计数据) 散列到完全2543846插槽/桶:

Example of a typical run using a list of 2543846 unique tokens/words (and their statistics) hashed into exactly 2543846 slots/buckets:

plasser@pisbak:~/src/hash$ ./diskhash woorden.txt woorden.hsh
Ptr = 0x7fb5c264f000, Sz = 37362821
Array= 0x7fb5bff7e000 Cnt = 2543846
__________________
Histogram of seek lenghts:
len:    Count     Hops   Fraction (Cumulative)
  1:  1606429  1606429 0.63149617 (0.63149617)
  2:   672469  1344938 0.26435130 (0.89584747)
  3:   205046   615138 0.08060472 (0.97645219)
  4:    48604   194416 0.01910650 (0.99555869)
  5:     9477    47385 0.00372546 (0.99928415)
  6:     1581     9486 0.00062150 (0.99990565)
  7:      215     1505 0.00008452 (0.99999017)
  8:       24      192 0.00000943 (0.99999961)
  9:        1        9 0.00000039 (1.00000000)
Tot:  2543846  3819498           (1.50147)
Cnt  2543846 Empty   937417 (0.36850) Collisions 247 RedDragon 7638996/7631537=1.000977
__________________

  • 空槽的分数为0.36850,这是关于什么应该是(1 / e)的
  • 与一个以上的项目(链长> 1)的时隙的分数也是关于(1 / e)的
  • 插槽恰好有1个项目的分数还剩下什么:: 1 - (2 / E)
  • 的碰撞次数的似乎的有点高,但用2.5M项目上的32位散列值是不例外。
    • the fraction of empty slots is 0.36850 , which is about what it should be (1/e)
    • the fraction of slots with more than one item (chain-length > 1) is also about (1/e)
    • the fraction of slots with exactly 1 item is what is left :: 1 - (2/e)
    • the number of collisions seems a bit high, but with 2.5 M items on a 32bits hashvalue it is not exceptional.
    • 这篇关于衡量一个散列函数质量(配图使用/ assosiative阵列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆