如何从唯一的字符串生成唯一的int? [英] how can i generate a unique int from a unique string?

查看:231
本文介绍了如何从唯一的字符串生成唯一的int?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有String的对象,该String包含唯一的id。
(例如ocx7gf或67hfs8)
我需要提供int hascode()的实现,这显然是唯一的。

I have an object with a String that holds a unique id . (such as "ocx7gf" or "67hfs8") I need to supply it an implementation of int hascode() which will be unique obviously.

如何以最简单/最快的方式将字符串转换为唯一的int?

how do i cast a string to a unique int in the easiest/fastest way?

10x。

编辑 - 好的。我已经知道String.hashcode是可能的。但不建议在任何地方使用。实际上'如果不推荐任何其他方法 - 我是否应该使用它,如果我的对象在集合中,我需要哈希码。我应该将其连接到另一个字符串以使其更成功吗?

Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?

推荐答案

不,你需要有一个返回唯一值的实现,很明显,显然大多数实现都会被破坏。

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.

你想要做的是,在比特上有一个很好的分布,特别是对于普通值(如果有的话)价值观比其他人更常见)。除非您对格式有特殊了解,否则只需使用字符串本身的哈希码就是最好的。

What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.

如果您对id格式的限制有所了解,那么它可能是有可能的。定制并获得更好的性能,虽然错误的假设更可能使事情变得更糟。

With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.

编辑:比特的良好传播。

On good spread of bits.

如此处和其他答案中所述,完全唯一是不可能的,并且哈希冲突是可能的。哈希使用方法知道这个并且可以处理它,但 会影响性能,因此我们希望碰撞很少见。

As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.

进一步,哈希通常会被重新散列,因此我们的32位数字最终可能会减少到例如一个在0到22的范围内,我们希望在尽可能好的分布范围内。

Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.

我们也希望平衡这一点,而不是花这么长时间来计算我们的哈希,它本身就成了瓶颈。一个不完美的平衡行为。

We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.

一个糟糕的哈希方法的典型例子是一对X,Y整数的坐标:

A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:

return X ^ Y;

虽然这可以很好地从4 ^ 32中返回2 ^ 32个可能的值输入,在现实世界中使用它有很常见的坐标集,其中X和Y相等({0,0},{1,1},{2,2}等等)所有哈希值为零或匹配对({2,3}和{3,2})将散列到相同的数字。我们可能更好地服务于:

While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:

return ((X << 16) | (x >> 16)) ^ Y;

现在, 就像许多可能的值一样可怕比起前者,但它往往在现实世界中更好地服务。

Now, there is just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.

当然,如果你正在写一个通用课程,那就有不同的工作(不知道有什么可能的输入)或者更好地了解手头的目的。例如,如果我使用Date对象但知道它们都只是日期(时间部分总是午夜)并且只在几年之内,那么我可能更喜欢仅使用日期,月份和日期的自定义哈希码年份的低位数,超过标准的数字。 日期的作者虽然不能运用这些知识,但必须尽力满足每个人的需要。

Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.

因此,例如,如果我知道一个给定的字符串总是由[az]或[0-9]范围内的6个不区分大小写的字符组成(你似乎是这样,但是你的问题并不清楚它是什么那么我可能会使用一个算法,为每个字符分配一个从0到35(每个字符36个可能的值)的值,然后遍历字符串,每次将当前值乘以36并添加值下一个字符。

Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.

假设在ID中有一个很好的传播,这将是要走的路,特别是如果我订单使得我的哈希中的较低有效数字匹配id中最频繁更改的字符(如果可以进行这样的调用),因此可以很好地重新散列到较小的范围。

Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.

然而,缺乏这样的知识格式肯定,我无法确定地打电话,而且我可能会让事情变得更糟(哈希质量很少或甚至负增益的算法都较慢)。

However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).

你有一个优点,因为它本身就是一个ID,那么可能没有其他不相等的对象具有相同的ID,因此不需要检查其他属性。这并不总是成立。

One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.

这篇关于如何从唯一的字符串生成唯一的int?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆