完美的/理想哈希隔离字谜 [英] Perfect/ideal hash to isolate anagrams

查看:259
本文介绍了完美的/理想哈希隔离字谜的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在努力加快测试字符串字谜快出来的行为,我想出了一个黄金为基础的散列方案 - 虽然它看起来像我不是第

的基本思想是要映射字母素数,并计算这些素数的乘积。字母的任何重排将具有相同的产品,并且如果结果可以任意大则没有的其他字母组合可以产生相同的结果。

我最初设想的这个作为的只是的哈希值。最终,该产品将溢出,并开始别名其他字母组合。然而,由最频繁的字母映射到的最小素数的乘积生长缓慢,并且常常可以完全避免溢出。在这种情况下,我们得到一个完美的哈希值,从而为两个明确的正面和负面的结果,而不进一步的测试。

什么是值得注意的是,它不溢出之前非常有效地填补了编码空间。没有结果会比103更大的质因子,和小素数的分布是固定的,不一定是伟大的比赛,以信件的频率。

现在我想知道是否有一些显着比这更好的。东西覆盖与完善哈希更多结果,并在剩余的情况下,强大的分销。

在最密集的编码方案,我能想到的是字母进行排序,然后将它们打包到与熵codeR一个字。在这个方案中的信频率显然会大大偏向因为施加到每个位置(例如,开始与Z A有序数组的可能性比用AZ结尾的有序数组的基本上下)的范围限制。

这听起来像一个整体大量的工作,虽然 - 我不能看到它保证给在溢出情况分配好

也许有更好的一系列因素的字母映射,以及更好的办法时走样的风险已经开始检测。或者说,不依赖于乘法散列方案?东西是很容易计算?

这就是:


  • 尽可能多的真实世界的输入尽可能完美的哈希值(为比特一些明智的数量)。

  • 强大的哈希剩余的情况下,以区分两种情况的一种手段。

  • 易于计算。

英语语言的限制(26个字母与典型的英式字结构)会做得很好。多字节编码方案是一个整体的其他问题。

C $ C $çpreferred因为我明白了。


解决方案

如果您使用的是n位哈希有大小米的字母,你可以得到字谜一个唯一的哈希达(nm)的字符使用方法我描述 href=\"http://stackoverflow.com/a/18174379/2272754\">。这使得碰撞检测没有必要,但是它取决于字母的大小和可用空间不会限制你的字的大小。

要允许任何长度的话,我会用N-1位做到这一点的哈希单词达(N-M-1)长度的字符,并保存最后一位,以表示该字为m字符或更长。在这种情况下,你会用剩下的n-1个比特的素数或者其他的哈希算法,但当然你必须做你有这些水桶多个单词碰撞检测随时随地。由于在现实世界的应用程序大多数的话会占用较短的字长,你会大幅度削减所需不再言语碰撞检测。

In an effort to accelerate fast-out behaviour on testing strings for anagrams, I came up with a prime-based hashing scheme -- although it looks like I wasn't the first.

The basic idea is to map letters to prime numbers, and to compute the product of these primes. Any rearrangement of the letters will have the same product, and if the result can be arbitrarily large then no combination of other letters can produce the same result.

I had initially envisioned this as just a hash. Eventually the product would overflow and start to alias other letter combinations. However, by mapping the most frequent letters to the smallest primes the product grows slowly and can often avoid overflow altogether. In this case we get a perfect hash, giving both definite positive and negative results without additional testing.

What's notable is that it doesn't fill the coding space very efficiently before overflowing. No result will have any prime factors greater than 103, and the distribution of small primes is fixed and not necessarily a great match to letter frequency.

Now I'm wondering if there's something substantially better than this. Something that covers more results with perfect hashes and has strong distribution in the remaining cases.

The densest coding scheme I can think of is to sort the letters and then pack them into a word with an entropy coder. In this scheme the letter frequency will obviously be enormously biased because of the range constraints applied to each position (eg., the likelihood of a sorted array starting with z is substantially lower than that of a sorted array ending with a z).

That sounds like a whole lot of work, though -- and I can't see it guaranteeing to give good distribution in the overflow case.

Perhaps there's a better set of factors to map the letters to, and a better way to detect when the risk of aliasing has started. Or a hashing scheme that doesn't rely on multiplication? Something that's easy to calculate?

So that's:

  • A perfect hash for as much real-world input as possible (for some sensible number of bits).
  • A strong hash for remaining cases, with a means of distinguishing the two cases.
  • Easy to calculate.

English language constraints (26 letters with typical English-like word structure) will do fine. Multi-byte coding schemes are a whole other problem.

C code preferred because I understand it.

解决方案

If you are using n-bit hashes with an alphabet of size m, you can get a unique hash for anagrams up to (n-m) characters long using the approach I described here. This makes collision detection unnecessary but it does limit your word size depending on the size of the alphabet and your available space.

To allow words of any length, I would use n-1 bits to do that hash for words up to (n-m-1) characters in length, and save the last bit to signal that the word is m characters or longer. In those cases you would use the remaining n-1 bits for your prime-number or other hashing algorithm, but of course you would have do collision detection anytime you got multiple words in those buckets. Since in a real-world application the majority of the words will occupy the shorter word lengths, you'll drastically cut the collision detection needed for the longer words.

这篇关于完美的/理想哈希隔离字谜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆