为什么Java中的String.hashCode()有很多冲突? [英] Why does String.hashCode() in Java have many conflicts?

查看:124
本文介绍了为什么Java中的String.hashCode()有很多冲突?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么String.hashcode()有这么多冲突?

Why does String.hashcode() have so many conflicts?

我正在读取jdk1.6中的String.hashCode(),下面是代码

I'm reading the String.hashCode() in jdk1.6, below is the codes

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}

这对我来说很混乱,因为它有很多冲突;虽然它不需要是唯一的(我们仍然可以依赖于equals()),但是更少的冲突意味着更好的性能而无需访问链表中的条目。

This looks quite confusing to me because it has so many conflicts; although it's not required to be unique (we can still rely on the equals()), but less conflicts means better performance without visiting the entries in a linked list.

假设我们有两个字符,那么只要我们找到两个匹配下面的方程的字符串,那么我们就会有相同的hashcode()

Suppose we've two characters, then so long as we can find two strings matching below equation, then we will have the same hashcode()

a * 31 +b = c * 31 +d

很容易得出结论(ac)* 31 = db
举一个简单的例子是make ac = 1和db = 31;
所以我写下面的代码进行简单测试

It will be easy to conclude that (a-c) * 31 = d-b take an easy example is make a-c = 1 and d-b = 31; so i wrote below codes for simple test

public void testHash() {
    System.out.println("A:" + (int)'A');
    System.out.println("B:" + (int)'B');
    System.out.println("a:" + (int)'a');

    System.out.println("Aa".hashCode() + "," + "BB".hashCode());
    System.out.println("Ba".hashCode() + "," + "CB".hashCode());
    System.out.println("Ca".hashCode() + "," + "DB".hashCode());
    System.out.println("Da".hashCode() + "," + "EB".hashCode());        
}

它将打印在结果下面,这意味着所有字符串都具有相同的hashcode() ,并且很容易在循环中完成。

it will print below results which means all the strings have the same hashcode(), and it's easy to do it in a loop.

A:65 
B:66
a:97
2112,2112
2143,2143
2174,2174
2205,2205

更糟糕的是,假设我们在字符串中有4个字符,根据算法,假设前2个字符产生a2,第2个2个字符产生b2;
哈希码仍然是 a2 * 31 ^ 2 + b2
因此,a2和b2在2个字符串之间相等,我们将获得更多字符串hashcode()冲突。
这样的例子是AaAa,BBBB等等;
那么我们将有6个字符,8个字符......

even worse, suppose we've 4 characters in the string, according to the algorithm, suppose the first 2 characters produce a2, the 2nd 2 characters produce b2; the hashcode will still be a2 * 31^2 + b2 thus, with a2 and b2 equal between 2 strings, we will get more strings with hashcode() conflict. such examples are "AaAa", "BBBB" and so on; then we will have 6 characters, 8 characters......

假设大多数时候我们在字符串中使用ascii表中的字符在hashmap或hashtable中使用,那么这里选择的素数31肯定太小;

suppose most of the time we use characters in ascii table in a string which will be used in a hashmap or hashtable, then the choosen prime number 31 here is definitely too small;

一个简单的解决方法是使用更大的素数(幸运的是,257是一个素数),可以避免这种冲突。当然,选择一个太大的数字会导致返回的int值溢出,如果字符串很长,但我假设大多数时候用作键的字符串不是那么大?
当然,它仍然可以返回一个很长的值来避免这种情况。

one easy fix is to use a larger prime number (luckily, 257 is a prime number) which can avoid this conflict. of course, choose a too big number will cause returned int value to be overflowed if the string is very long, but i assume most of the time the string used as a key is not that big? of course, it could still return a long value to avoid this.

下面是我修改过的betterhash()版本可以解决这些冲突通过运行下面会打印的代码来轻松
,这有效解决了这个问题。

16802,17028
17059,17285
17316,17542
17573,17799

但是为什么jdk没有解决它?谢谢。

but why jdk does not fix it? thx.

@Test
public void testBetterhash() {
    System.out.println(betterHash("Aa") + "," + betterHash("BB"));      
    System.out.println(betterHash("Ba") + "," + betterHash("CB"));
    System.out.println(betterHash("Ca") + "," + betterHash("DB"));
    System.out.println(betterHash("Da") + "," + betterHash("EB"));
}

public static int betterHash(String s) {
    int h = 0;
    int len = s.length();

    for (int i = 0; i < len; i++) {
        h = 257*h + s.charAt(i);
    }
    return h;
}


推荐答案

我刚刚练习了58,000英语语言单词(在这里),全部小写,第一个字母大写。知道有多少相撞?二:兄弟姐妹和德黑兰(德黑兰的另一种拼写)。

I just hashed 58 thousand English language words (found here), both all-lowercase and also with the first letter capitalized . Know how many collided? Two: "Siblings" and "Teheran" (an alternate spelling of "Tehran").

就像你一样,我拿了一个子域(在我的情况下可能是一个)可能的字符串并分析了它的hashCode冲突率,并发现它是典型的。谁能说你的可选字符串的任意子域是比我的优化更好的选择?

Just like you, I took a subdomain (in my case a likely one) of possible strings and analyzed the hashCode collision rate for it, and found it to be exemplary. Who is to say that your arbitrary subdomain of possible strings is a better choice to optimize for than mine?

编写此类的人必须这样做,因为他们无法预测(也不会优化)其用户将字符串用作键的子域。所以他们选择了一个散列函数,它在整个字符串域上均匀分布。

The people that wrote this class had to do so knowing that they couldn't predict (nor therefore optimize) the subdomain in which their users would be using Strings as keys. So they chose a hash function which distributes evenly over the entire domain of strings.

如果你有兴趣,这是我的代码(它使用番石榴):

If you're interested, here's my code (it uses Guava):

    List<String> words = CharStreams.readLines(new InputStreamReader(StringHashTester.class.getResourceAsStream("corncob_lowercase.txt")));
    Multimap<Integer, String> wordMap = ArrayListMultimap.create();
    for (String word : words) {
        wordMap.put(word.hashCode(), word);
        String capitalizedWord = word.substring(0, 1).toUpperCase() + word.substring(1);
        wordMap.put(capitalizedWord.hashCode(), capitalizedWord);
    }

    Map<Integer, Collection<String>> collisions = Maps.filterValues(wordMap.asMap(), new Predicate<Collection<String>>() {
        public boolean apply(Collection<String> strings) {
            return strings.size() > 1;
        }
    });

    System.out.println("Number of collisions: " + collisions.size());
    for (Collection<String> collision : collisions.values()) {
        System.out.println(collision);
    }



编辑



顺便说一句,如果你很好奇,与你的哈希函数相同的测试有13次冲突,而 String.hashCode 的是1。

这篇关于为什么Java中的String.hashCode()有很多冲突?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆