散列浮点值 [英] Hashing floating point values

查看:128
本文介绍了散列浮点值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我很好奇浮点算法是如何工作的,所以我看了 boost :: hash_value 的源代码。原来是相当复杂 。实际的实现在基数中的每个数字上循环并累积一个散列值。与整数散列函数相比,它涉及更多。

我的问题是:为什么浮点散列算法更复杂?为什么不把浮点值的二进制表示法看作是一个整数?





<$ (& f))); pre $ {code> std :: size_t hash_value(float f)
{
return hash_value(*(reinterpret_cast< int *>(& f)));

我意识到 float 在所有系统上不能保证与 int 大小相同,但是这种事情可以用一些模板元程序来处理,以推导出一个整型与 float 大小相同。那么引入一个完全不同的散列函数的好处是什么呢?特别是对浮点类型的操作呢?

解决方案

为什么你想要哈希浮点值?出于同样的原因,比较浮点值的平等有一些陷阱,哈希他们可以有类似(负面)的后果。

然而,鉴于你真的想做这个,我怀疑boost算法是复杂的,因为当你考虑到非规格化的数字时,不同的位模式可以代表相同的数字(并且应该可能具有相同的散列)。在IEEE 754中,也有正数和负数的 0 值相等,但有不同的位模式。

可能不会在哈希中出现,如果它不会在你的算法中出现,但你仍然需要注意发信号的NaN值。

另外什么会是散列+/-无穷大和/或NaN的含义?具体来说,NaN可以有很多表示,他们都应该导致相同的散列?无限似乎只有两个表示,所以它似乎可以工作了。


Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value. It turns out to be fairly complicated. The actual implementation loops over each digit in the radix and accumulates a hash value. Compared to the integer hash functions, it's much more involved.

My question is: why should a floating-point hash algorithm be any more complicated? Why not just hash the binary representation of the floating point value as if it was an integer?

Like:

std::size_t hash_value(float f)
{
  return hash_value(*(reinterpret_cast<int*>(&f)));
}

I realize that float is not guaranteed to be the same size as int on all systems, but that sort of thing could be handled with a few template meta-programs to deduce an integral type that is the same size as float. So what is the advantage of introducing an entirely different hash function that specifically operates on floating point types?

解决方案

Why are you wanting to hash floating point values? For the same reason that comparing floating point values for equality has a number of pitfalls, hashing them can have similar (negative) consequences.

However given that you really do want to do this, I suspect that the boost algorithm is complicated because when you take into account denormalized numbers different bit patterns can represent the same number (and should probably have the same hash). In IEEE 754 there are also both positive and negative 0 values that compare equal but have different bit patterns.

This probably wouldn't come up in the hashing if it wouldn't have come up otherwise in your algorithm but you still need to take care about signaling NaN values.

Additionally what would be the meaning of hashing +/- infinity and/or NaN? Specifically NaN can have many representations, should they all result in the same hash? Infinity seems to have just two representations so it seems like it would work out ok.

这篇关于散列浮点值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆