无论顺序如何,获取字符串列表的哈希值 [英] Getting hash of a list of strings regardless of order

查看:22
本文介绍了无论顺序如何,获取字符串列表的哈希值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个函数 GetHashCodeOfList() 返回一个字符串列表的哈希码,而不管顺序.给定 2 个具有相同字符串的列表应该返回相同的哈希码.

I would like to write a function GetHashCodeOfList() which returns a hash-code of a list of strings regardless of order. Given 2 lists with the same strings should return the same hash-code.

ArrayList list1 = new ArrayList()    
list1.Add("String1");
list1.Add("String2");
list1.Add("String3");    

ArrayList list2 = new ArrayList()    
list2.Add("String3");    
list2.Add("String2"); 
list2.Add("String1");

GetHashCodeOfList(list1) = GetHashCodeOfList(list2) //this should be equal.

我有一些想法:

  1. 我可以先对列表进行排序,然后将排序后的列表组合成1个长字符串,然后调用GetHashCode().然而,排序是一个缓慢的操作.

  1. I can first sort the list, then combine the sorted list into 1 long string and then call GetHashCode(). However sorting is a slow operation.

我可以获取列表中每个单独字符串的哈希值(通过调用 string.GetHashCode()),然后将所有哈希值相乘并调用 Mod UInt32.MaxValue.例如:"String1".GetHashCode() * "String2".GetHashCode * ... MOD UInt32.MaxValue.但这会导致数字溢出.

I can get the hash of each individual string (by calling string.GetHashCode()) in the list, then multiplying all hashes and calling Mod UInt32.MaxValue. For example: "String1".GetHashCode() * "String2".GetHashCode * … MOD UInt32.MaxValue. But this results in a number overflow.

有人有什么想法吗?

预先感谢您的帮助.

推荐答案

这里有各种不同的方法,主要分为两大类,在有效性和性能方面,每种方法通常都有自己的优点和缺点.最好为任何应用程序选择最简单的算法,并在任何情况下仅在必要时使用更复杂的变体.

There are various different approaches here the under two main categories, each typically with their own benefits and disadvantages, in terms of effectiveness and performance. It is probably best to choose the simplest algorithm for whatever application and only use the more complex variants if necessary for whatever situation.

请注意,这些示例使用 EqualityComparer<T>.Default 因为这将干净地处理空元素.如果需要,您可以为 null 做得比零更好.如果 T 被约束为 struct ,它也是不必要的.如果需要,您可以从函数中提升 EqualityComparer.Default 查找.

Note that these examples use EqualityComparer<T>.Default since that will deal with null elements cleanly. You could do better than zero for null if desired. If T is constrained to struct it is also unnecessary. You can hoist the EqualityComparer<T>.Default lookup out of the function if so desired.

如果您对可交换的各个条目的哈希码进行操作,那么这将无论顺序如何,都会导致相同的最终结果.

If you use operations on the hashcodes of the individual entries which are commutative then this will lead to the same end result regardless of order.

数字有几个明显的选项:

There are several obvious options on numbers:

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element);
    }
    return hash;
}

一个缺点是 { "x", "x" } 的散列与 { "y", "y" } 的散列相同.不过,如果这对您的情况来说不是问题,那么这可能是最简单的解决方案.

One downside of that is that the hash for { "x", "x" } is the same as the hash for { "y", "y" }. If that's not a problem for your situation though, it's probably the simplest solution.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = unchecked (hash + 
            EqualityComparer<T>.Default.GetHashCode(element));
    }
    return hash;
}

这里溢出很好,因此显式unchecked上下文.

Overflow is fine here, hence the explicit unchecked context.

仍然有一些令人讨厌的情况(例如 {1, -1} 和 {2, -2},但它更有可能没问题,特别是对于字符串.对于可能包含此类整数的列表,您可以始终实现自定义散列函数(可能将特定值的重复索引作为参数并相应地返回唯一的散列代码).

There are still some nasty cases (e.g. {1, -1} and {2, -2}, but it's more likely to be okay, particularly with strings. In the case of lists that may contain such integers, you could always implement a custom hashing function (perhaps one that takes the index of recurrence of the specific value as a parameter and returns a unique hash code accordingly).

以下是此类算法的示例,它以相当有效的方式解决了上述问题.它还具有大大增加生成的散列码分布的好处(有关某些解释,请参阅末尾链接的文章).对该算法究竟如何产生更好"的哈希码进行数学/统计分析将是相当先进的,但在大范围的输入值上测试它并绘制结果应该足以验证它.

Here is an example of such an algorithm that gets around the aforementioned problem in a fairly efficient manner. It also has the benefit of greatly increasing the distribution of the hash codes generated (see the article linked at the end for some explanation). A mathematical/statistical analysis of exactly how this algorithm produces "better" hash codes would be quite advanced, but testing it across a large range of input values and plotting the results should verify it well enough.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    int curHash;
    int bitOffset = 0;
    // Stores number of occurences so far of each value.
    var valueCounts = new Dictionary<T, int>();

    foreach (T element in source)
    {
        curHash = EqualityComparer<T>.Default.GetHashCode(element);
        if (valueCounts.TryGetValue(element, out bitOffset))
            valueCounts[element] = bitOffset + 1;
        else
            valueCounts.Add(element, bitOffset);

        // The current hash code is shifted (with wrapping) one bit
        // further left on each successive recurrence of a certain
        // value to widen the distribution.
        // 37 is an arbitrary low prime number that helps the
        // algorithm to smooth out the distribution.
        hash = unchecked(hash + ((curHash << bitOffset) |
            (curHash >> (32 - bitOffset))) * 37);
    }

    return hash;
}

乘法

与加法相比,这几乎没有什么好处:小数和正负数的混合可能会导致散列位的更好分布.作为抵消这个1"的负数,它变成了一个没有贡献的无用条目,任何零元素都会导致零.您可以将特殊情况设为零,以免造成此重大缺陷.

Multiplication

Which has few if benefits over addition: small numbers and a mix of positive and negative numbers they may lead to a better distribution of hash bits. As a negative to offset this "1" becomes a useless entry contributing nothing and any zero element results in a zero. You can special-case zero not to cause this major flaw.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 17;
    foreach (T element in source)
    {
        int h = EqualityComparer<T>.Default.GetHashCode(element);
        if (h != 0)
            hash = unchecked (hash * h);
    }
    return hash;
}

先下单

另一种核心方法是先强制执行一些排序,然后使用您喜欢的任何散列组合函数.只要保持一致,排序本身并不重要.

Order first

The other core approach is to enforce some ordering first, then use any hash combination function you like. The ordering itself is immaterial so long as it is consistent.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source.OrderBy(x => x, Comparer<T>.Default))
    {
        // f is any function/code you like returning int
        hash = f(hash, element);
    }
    return hash;
}

这有一些显着的好处,因为 f 中可能的组合操作可以具有明显更好的散列属性(例如位分布),但这样做的成本要高得多.排序是 O(n log n) 并且集合的所需副本是您无法避免的内存分配,因为您希望避免修改原始内容.GetHashCode 实现通常应该完全避免分配.f 的一种可能实现类似于加法部分下的最后一个示例中给出的实现(例如,向左移动任意常数数量的位移,然后与素数相乘 - 您甚至可以在上使用连续素数每次迭代无需额外成本,因为它们只需要生成一次).

This has some significant benefits in that the combining operations possible in f can have significantly better hashing properties (distribution of bits for example) but this comes at significantly higher cost. The sort is O(n log n) and the required copy of the collection is a memory allocation you can't avoid given the desire to avoid modifying the original. GetHashCode implementations should normally avoid allocations entirely. One possible implementation of f would be similar to that given in the last example under the Addition section (e.g. any constant number of bit shifts left followed by a multiplication by a prime - you could even use successive primes on each iteration at no extra cost, since they only need be generated once).

也就是说,如果您处理的情况是您可以计算和缓存散列并通过多次调用 GetHashCode 来分摊成本,那么这种方法可能会产生更好的行为.此外,后一种方法更加灵活,因为如果知道元素的类型,它可以避免在元素上使用 GetHashCode 的需要,而是对它们使用按字节操作以产生更好的散列分布.这种方法可能仅在性能被确定为重大瓶颈的情况下才有用.

That said, if you were dealing with cases where you could calculate and cache the hash and amortize the cost over many calls to GetHashCode this approach may yield superior behaviour. Also the latter approach is even more flexible since it can avoid the need to use the GetHashCode on the elements if it knows their type and instead use per byte operations on them to yield even better hash distribution. Such an approach would likely be of use only in cases where the performance was identified as being a significant bottleneck.

最后,如果您想对哈希码的主题及其一般有效性进行相当全面且相当非数学的概述,这些博文 值得一读,尤其是实施简单的散列算法 (pt II) 博文.

Finally, if you want a reasonably comprehensive and fairly non-mathematical overview of the subject of hash codes and their effectiveness in general, these blog posts would be worthwhile reads, in particular the Implementing a simple hashing algorithm (pt II) post.

这篇关于无论顺序如何,获取字符串列表的哈希值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆