如何GetHash code#字符串的C()来实现? [英] How is GetHashCode() of C# string implemented?

查看:422
本文介绍了如何GetHash code#字符串的C()来实现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是好奇,因为我想这会对性能产生影响。是否考虑完整的字符串?如果是的话,这将是长长的一串慢。如果只考虑字符串的一部分,这将有糟糕的表现(例如,如果只考虑字符串的开头,就会产生不好的性能,如果HashSet的主要含有具有相同的字符串。

I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

推荐答案

一定要得到的当你有这样的问题,参考源源$ C ​​$ C 。还有很多更给它比你可以从反编译器看到的。挑选一个符合你的preferred .NET的目标,该方法已经改变版本之间有很多。我就重现了.NET 4.5的版本在这里,从Source.NET 4.5 \ 4.6.0.0 \网络检索\ CLR的\ src \ BCL \ SYSTEM \ String.cs \ 604718 \ String.cs

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
            if(HashHelpers.s_UseRandomizedStringHashing)
            { 
                return InternalMarvin32HashString(this, this.Length, 0);
            } 
#endif // FEATURE_RANDOMIZED_STRING_HASHING 

            unsafe { 
                fixed (char *src = this) {
                    Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                    Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
                    int hash1 = (5381<<16) + 5381; 
#else 
                    int hash1 = 5381;
#endif 
                    int hash2 = hash1;

#if WIN32
                    // 32 bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length; 
                    while (len > 2) 
                    {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2;
                        len  -= 4;
                    } 

                    if (len > 0) 
                    { 
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                    } 
#else
                    int     c;
                    char *s = src;
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c;
                        c = s[1]; 
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c; 
                        s += 2;
                    }
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily.
                    // This is perfectly fine as long as you don't persist the 
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber; 
#endif
                    return hash1 + (hash2 * 1566083941);
                }
            } 
        }

这是比你讨价还价的可能更多,我会注释codeA位:

This is possibly more than you bargained for, I'll annotate the code a bit:

  • 在该#如果条件编译指令调整该code,以不同的.NET对象。该FEATURE_XX标识符别处定义和转功能关闭整个.NET源$ C ​​$ C整体出售。 WIN32定义为当目标是32位版本的框架,64位版本mscorlib.dll中的单独建立,并存储在GAC的不同的子目录。
  • 的s_UseRandomizedStringHashing变量使散列算法,目的是让程序员摆脱困境是做一些不明智喜欢用GetHash code()来生成散列的东西像密码或加密的安全版本。它是由启用在app.exe.config文件中的条目
  • 固定的声明保持索引字符串便宜,避免了常规索引做检查的范围
  • 在第一个断言确保字符串零终止它应该是,要求允许在循环
  • 优化
  • 第二个断言确保串对准一个地址是4的倍数,因为它应该是,保持环高性能要求
  • 循环用手展开,消耗每个循环4个字符的32位版本。该强制转换为int *是一招,存储2个字符(2×16位)的INT(32位)。循环处理一个其长度的字符串不是4。注意,零终止子可以或可以不被包括在散列的倍数后的额外的语句,它不会是,如果长度是偶数。它着眼于的所有的字符串中的字符,回答你的问题
  • 在64位版本的循环是做不同的,手展开除以2。注意,早在嵌入式零终止,因此不会看所有的字符。否则,很罕见。这是pretty的多,我只能猜想,这是与字符串可能是非常大的。但想不到一个实际的例子
  • 在调试code在年底确保没有code在框架曾经发生在散列code是可重复的运行之间的依赖关系。
  • 在散列算法是pretty的标准。该值1566083941是一个神奇的数字,主要是在梅森捻线机
  • The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
  • The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
  • The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
  • The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
  • The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
  • The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
  • The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
  • The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
  • The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

这篇关于如何GetHash code#字符串的C()来实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆