设计有效的哈希 [英] Designing an efficient hash

查看:86
本文介绍了设计有效的哈希的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组节点(N)排列在一个或多个网络中。每个网络都有一个或多个根节点。每个节点都有一个32位整数的唯一ID(UID)。如果有100,000个节点是一个大问题 - 1,000,000几乎是不可能的巨大。我需要检查是否所有节点都连接到至少一个根节点,我正在尝试设计一个哈希函数来帮助我。



我当前的代码从每个根节点开始并跟踪所有路径,记录每个连接的节点。它没有散列就可以做到这一点,因为之前我可以依赖UID相对较小且大部分是连续的。除了起始偏移量,它只是直接使用UID索引到位图数组。将位图与N的成员进行比较以确定哪些节点未连接。



然而,现在,我遇到了存在较大差距的情况UIDS(例如,从100,000跳到100,000,000) - 这使得简单地增加位图的大小越来越不切实际,因此需要散列。在一般情况下,如果用户没有完全搞砸,我希望未连接节点的数量少于连接节点的数量(但不一定)。



例如,我如何选择应该使用多少桶进行散列以平衡内存使用与冲突可能性?可以(这应该吗?)动态完成吗? (也就是说,我应该根据N的大小动态分配我的桶阵列吗?)通过将所有节点UID散列到桶中,然后在发现它们时清除连接的哈希值,是否更有效率?我怎样才能确定处理碰撞的最佳方法?是否有可能动态创建一个完美的哈希?



我尝试过:



没有比谷歌搜索更多了。我不是计算机科学家,所以这对我来说有点新鲜。我发现大量的在线信息有点难以消化 - 因此对专家有吸引力。

I have a set of nodes (N) arranged into one or more networks. Each network has one or more root nodes. Each node has a unique ID (UID) that is a 32-bit integer. If there were 100,000 nodes that would be a large problem - 1,000,000 would be almost infeasibly gigantic. I need to check if all nodes are connected to at least one root-node, and I'm trying to design a hash function to help me.

My current code starts at each root node and traces all paths, recording each connected node. It does this without hashing, since previously I could rely on the UIDs being relative small and mostly contiguous. Along with a starting offset, it just uses the UID directly to index into a bitmap array. The bitmap is the compared against the members of N to determine which nodes are not connected.

Now, however, I'm coming across situations where there are large gaps in the UIDS (e.g., jumping from 100,000 to 100,000,000) - this make simply increasing the size of the bitmap more and more impractical, hence the need for a hash. In the general case, provided the user hasn't completely messed up, I would expect the number of unconnected nodes to be less than the number of connected ones (but not necessarily).

How do I, for instance, pick how many buckets I should use for hashing to balance memory use against likelihood of collisions? Can this (should this?) be done dynamically? (i.e., should I dynamically allocate my bucket array based on the size of N?) Is it more efficient to start by hashing all my node UIDs into buckets, then clearing the 'connected' hashes as they are discovered? How can I can I determine the best way to handle collisions? Is it possible to create a perfect hash dynamically?

What I have tried:

Nothing much more than Googling. I'm not a computer scientist, so this is a bit new to me. I'm finding the masses of online information a bit hard to digest - hence the appeal to the experts.

推荐答案

如果你需要检查是否存在或缺乏列表中的某些内容并且您没有内存容量或存储使用不可行布隆过滤器 - 维基百科 [ ^ ]。



Bloom过滤器如果没有在bloom位中编码,则会给你一个明确的负数,并且它的存在可能是正的(概率是可调的)。



[哈希函数按定义会有冲突,并且没有完美哈希函数,只有统计上的哈希函数,即没有值的聚集]
If you need to check for the existence or lack there of something in a list and you don't have the memory capacity or it is infeasible to store use Bloom filter - Wikipedia[^] .

Bloom filters will give you a definite negative if it is not encoded in the bloom bits, and a probable positive for it's existence (the probability is tune able).

[hash functions by definition will have collision and there is no "perfect" hash function only statistically even hash functions i.e. no clumping of values]


这篇关于设计有效的哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆