获取订单的字符串列表的散列不管 [英] Getting hash of a list of strings regardless of order

查看:169
本文介绍了获取订单的字符串列表的散列不管的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个函数 GetHash codeOfList()返回秩序字符串列表的散列code不管。用相同的字符串鉴于2列出了应返回相同的散列code。

I would like to write a function GetHashCodeOfList() which returns a hash-code of a list of strings regardless of order. Given 2 lists with the same strings should return the same hash-code.

ArrayList list1 = new ArrayList()    
list1.Add("String1");
list1.Add("String2");
list1.Add("String3");    

ArrayList list2 = new ArrayList()    
list2.Add("String3");    
list2.Add("String2"); 
list2.Add("String1");

GetHashCodeOfList(list1) = GetHashCodeOfList(list2) //this should be equal.

我有几个想法:

I had a few thoughts:

  1. 我可以先对列表进行排序,然后排序列表合并成1长字符串,然后调用 GetHash code()。然而排序是一个缓慢的操作。

  1. I can first sort the list, then combine the sorted list into 1 long string and then call GetHashCode(). However sorting is a slow operation.

我可以得到的每一个字符串的哈希(通过调用 string.GetHash code())的列表中,然后乘以所有哈希和要求mod UInt32.MaxValue 。 例如:字符串1.GetHash code()*字符串2.GetHash code * ... MOD UInt32.MaxValue 。但是,这会导致一些溢出。

I can get the hash of each individual string (by calling string.GetHashCode()) in the list, then multiplying all hashes and calling Mod UInt32.MaxValue. For example: "String1".GetHashCode() * "String2".GetHashCode * … MOD UInt32.MaxValue. But this results in a number overflow.

有没有人有什么想法?

在此先感谢您的帮助。

推荐答案

有各种不同的方法在这里下的两个主要类别,分别通常用自己的优点和缺点,在效率和性能方面。这可能是最好的选择最简单的算法,不管是什么应用程序,并在必要时只使用了更复杂的变种任何情况。

There are various different approaches here the under two main categories, each typically with their own benefits and disadvantages, in terms of effectiveness and performance. It is probably best to choose the simplest algorithm for whatever application and only use the more complex variants if necessary for whatever situation.

请注意,这些示例使用 EqualityComparer< T> .DEFAULT ,因为这将涉及null元素干净。你可以,如有需要,空做得比为零。如果T是约束为结构体也是不必要的。您可以扯起 EqualityComparer< T>。.DEFAULT 查找出来,如果​​需要的功能

Note that these examples use EqualityComparer<T>.Default since that will deal with null elements cleanly. You could do better than zero for null if desired. If T is constrained to struct it is also unnecessary. You can hoist the EqualityComparer<T>.Default lookup out of the function if so desired.

如果您使用的运算散列$ C $这是交换的各个条目的CS,那么这将导致不论顺序相同的最终结果。

If you use operations on the hashcodes of the individual entries which are commutative then this will lead to the same end result regardless of order.

有几个明显的期权数量:

There are several obvious options on numbers:

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = hash ^ EqualityComparer<T>.Default.GetHashCode(element);
    }
    return hash;
}

这方面的一个缺点是,散列为{×,×}是相同的散列为{Y,y的}。如果不是因为尽管你的情况有问题,这可能是最简单的解决方案。

One downside of that is that the hash for { "x", "x" } is the same as the hash for { "y", "y" }. If that's not a problem for your situation though, it's probably the simplest solution.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source)
    {
        hash = unchecked (hash + 
            EqualityComparer<T>.Default.GetHashCode(element));
    }
    return hash;
}

溢出是在这里很好,因此明确选中上下文。

还有一些讨厌的情况下(例如{1,-1}和{2,-2},但它更可能是好的,特别是字符串,在名单的情况下,可能包含这样的整数,你可以始终贯彻一个自定义的哈希函数(也许是一种采用复发的特定值作为参数的指标,并返回一个唯一的哈希code相应的)。

There are still some nasty cases (e.g. {1, -1} and {2, -2}, but it's more likely to be okay, particularly with strings. In the case of lists that may contain such integers, you could always implement a custom hashing function (perhaps one that takes the index of recurrence of the specific value as a parameter and returns a unique hash code accordingly).

下面是这样一种算法,得到周围上述问题中一个相当有效的方式的一个例子。它也有极大地提高C $ CS散列$生成的分布(参见文章在末端连接的以下一些解释)的益处。究竟如何算法产生的数学/统计分析好的散列codeS将是相当先进的,但测试它在大的范围的输入值,并绘制出结果应该验证它不够好。

Here is an example of such an algorithm that gets around the aforementioned problem in a fairly efficient manner. It also has the benefit of greatly increasing the distribution of the hash codes generated (see the article linked at the end for some explanation). A mathematical/statistical analysis of exactly how this algorithm produces "better" hash codes would be quite advanced, but testing it across a large range of input values and plotting the results should verify it well enough.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    int curHash;
    int bitOffset = 0;
    // Stores number of occurences so far of each value.
    var valueCounts = new Dictionary<T, int>();

    foreach (T element in source)
    {
        curHash = EqualityComparer<T>.Default.GetHashCode(element);
        if (valueCounts.TryGetValue(element, out bitOffset))
            valueCounts[element] = bitOffset + 1;
        else
            valueCounts.Add(element, bitOffset);

        // The current hash code is shifted (with wrapping) one bit
        // further left on each successive recurrence of a certain
        // value to widen the distribution.
        // 37 is an arbitrary low prime number that helps the
        // algorithm to smooth out the distribution.
        hash = unchecked(hash + ((curHash << bitOffset) |
            (curHash >> (32 - bitOffset))) * 37);
    }

    return hash;
}

其中有几个如果超过另外的好处:小数目和正数和负数它们可能导致更好的分布散列位的混合。作为阴性抵消这个1变成了无用项贡献什么,任何零元的结果是零。 您可以特殊情况下,零不造成这一重大缺陷。

Multiplication

Which has few if benefits over addition: small numbers and a mix of positive and negative numbers they may lead to a better distribution of hash bits. As a negative to offset this "1" becomes a useless entry contributing nothing and any zero element results in a zero. You can special-case zero not to cause this major flaw.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 17;
    foreach (T element in source)
    {
        int h = EqualityComparer<T>.Default.GetHashCode(element);
        if (h != 0)
            hash = unchecked (hash * h);
    }
    return hash;
}

令第

其他核心的做法是先执行一些排序,然后使用任何哈希函数的组合你喜欢的。排序本身并不重要,只要它是一致的。

Order first

The other core approach is to enforce some ordering first, then use any hash combination function you like. The ordering itself is immaterial so long as it is consistent.

public static int GetOrderIndependentHashCode<T>(IEnumerable<T> source)
{
    int hash = 0;
    foreach (T element in source.OrderBy(x => x, Comparer<T>.Default))
    {
        // f is any function/code you like returning int
        hash = f(hash, element);
    }
    return hash;
}

这有可能是组合操作的 F 部分显著好处是可以有显著更好的散列性能(位例如分配),但是这是以显著成本较高。排序是为O(n log n)的和收集所需要的副本是你无法避免的。为避免修改原始的欲望内存分配。 GetHash code 实现一般应避免分配完全。一种可能的实现的 F 将类似于在根据加法部分的最后一个例子中给出(如比特移位的任何恒定数目左依次乘以一个素 - 你甚至可以在不增加成本用在每次迭代连续的素数,因为他们只需要生成一次)。

This has some significant benefits in that the combining operations possible in f can have significantly better hashing properties (distribution of bits for example) but this comes at significantly higher cost. The sort is O(n log n) and the required copy of the collection is a memory allocation you can't avoid given the desire to avoid modifying the original. GetHashCode implementations should normally avoid allocations entirely. One possible implementation of f would be similar to that given in the last example under the Addition section (e.g. any constant number of bit shifts left followed by a multiplication by a prime - you could even use successive primes on each iteration at no extra cost, since they only need be generated once).

这是说,如果你处理的情况下,你可以计算和缓存的散列和摊销的成本比许多调用 GetHash code 这种方法可能产生优越的行为。另外,后一种方法是更灵活的,因为它可避免需要使用 GetHash code 上的元素,如果它知道它们的类型和代替每字节操作上使用他们得到更好的散列分布。这种做法很可能是只使用的情况下的表现被认定为是一个显著瓶颈。

That said, if you were dealing with cases where you could calculate and cache the hash and amortize the cost over many calls to GetHashCode this approach may yield superior behaviour. Also the latter approach is even more flexible since it can avoid the need to use the GetHashCode on the elements if it knows their type and instead use per byte operations on them to yield even better hash distribution. Such an approach would likely be of use only in cases where the performance was identified as being a significant bottleneck.

最后,如果你想有一个合理的COM prehensive和散列codeS的主体,他们的总体有效性,的这些博客文章将是值得读,尤其是实现一个简单的散列算法(PT II)的帖子。

Finally, if you want a reasonably comprehensive and fairly non-mathematical overview of the subject of hash codes and their effectiveness in general, these blog posts would be worthwhile reads, in particular the Implementing a simple hashing algorithm (pt II) post.

这篇关于获取订单的字符串列表的散列不管的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆