字节数组的HashSet [英] HashSet for byte arrays

查看:42
本文介绍了字节数组的HashSet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个用于字节数组的HashSet,以便检查集合中是否存在给定的字节数组.但这似乎不适用于字节数组(或任何数组).

I need a HashSet for byte arrays in order to check if a given byte array exists in the collection. But it seems like this doesn't work for byte arrays (or perhaps any array).

这是我的测试代码:

void test()
{
    byte[] b1 = new byte[] { 1, 2, 3 };
    byte[] b2 = new byte[] { 1, 2, 3 };

    HashSet<byte[]> set = new HashSet<byte[]>();
    set.Add(b1);
    set.Add(b2);
    Text = set.Count.ToString();//returns 2 instead of the expected 1.
}

有没有一种方法可以为字节数组创建HashSet?

Is there a way to make a HashSet for byte arrays?

推荐答案

使用 IEqualityComparer< byte []> 构造 HashSet .您不想在这里使用界面.尽管 byte [] 实际上确实实现了诸如 IEnumerable< byte> IList< byte> 等接口,但使用它们是一种由于涉及的重重性,这是个坏主意.您根本不会使用 string 完全实现 IEnumerable< char> 的事实,因此也不必使用 byte [] .

Construct a HashSet with an IEqualityComparer<byte[]>. You don't want to use an interface here. While byte[] does in fact implement interfaces such as IEnumerable<byte>, IList<byte>, etc., use of them is a bad idea due to the weightiness involved. You don't use the fact that string implements IEnumerable<char> much at all so don't for byte[] either.

public class bytearraycomparer : IEqualityComparer<byte[]> {
    public bool Equals(byte[] a, byte[] b)
    {
        if (a.Length != b.Length) return false;
        for (int i = 0; i < a.Length; i++)
            if (a[i] != b[i]) return false;
        return true;
    }
    public int GetHashCode(byte[] a)
    {
        uint b = 0;
        for (int i = 0; i < a.length; i++)
            b = ((b << 23) | (b >> 9)) ^ a[i];
        return unchecked((int)b);
    }
}

void test()
{
    byte[] b1 = new byte[] { 1, 2, 3 };
    byte[] b2 = new byte[] { 1, 2, 3 };

    HashSet<byte[]> set = new HashSet<byte[]>(new bytearraycomparer );
    set.Add(b1);
    set.Add(b2);
    Text = set.Count.ToString();
}

https://msdn.microsoft.com/en-us/library/bb359100(v = vs.110).aspx

如果要在提出的重复问题中使用答案,最终将需要处理一个函数调用和每个处理过的字节一个数组边界检查.你不要那样如果以这种最简单的方式表示,则抖动将内联取指令,然后注意边界检查不会失败(无法调整数组的大小)并忽略它们.整个数组只有一个函数调用.是的.

If you were to use the answers in proposed duplicate question, you would end up with one function call and one array bounds check per byte processed. You don't want that. If expressed in the simplest way like so, the jitter will inline the fetches, and then notice that the bounds checks cannot fail (arrays can't be resized) and omit them. Only one function call for the entire array. Yay.

与字节数组相比,列表往往只有几个元素,因此经常使用简单的哈希函数,例如 foreach(列表中的可变项)hashcode = hashcode * 5 + item.GetHashCode();如果对字节数组使用这种哈希函数,则会遇到问题.乘以一个小的奇数技巧最终会在这里变得过于偏颇,以至于在舒适性方面.我在这里给出的特定哈希函数可能不是最优的,但是我们已经对该系列进行了测试,并且在执行300万个条目时表现良好.由于具有许多只有两个字节长/不同的冲突,所以乘以奇数的麻烦太快了.如果您避免使用简并数字,则该族不会在两个字节中发生冲突,并且大多数不会在三个字节中发生冲突.

Lists tend to have only a few elements as compared to a byte array so often the dirt-simple hash function such as foreach (var item in list) hashcode = hashcode * 5 + item.GetHashCode(); if you use that kind of hash function for byte arrays you will have problems. The multiply by a small odd number trick ends up being rather biased too quickly for comfort here. My particular hash function given here is probably not optimal but we have run tests on this family and it performs quite well with three million entries. The multiply-by-odd was getting into trouble too quickly due to possessing numerous collisions that were only two bytes long/different. If you avoid the degenerate numbers this family will have no collisions in two bytes and most of them have no collisions in three bytes.

考虑实际用例:到目前为止,这里最可能出现的两件事是字节字符串和要检查的实际文件是否相同.在任何一种情况下,采用前几个字节的哈希码很可能是个坏主意. String 的哈希码使用整个字符串,因此字节字符串也应这样做,并且大多数要复制的文件在前几个字节中都没有唯一的前缀.对于N个条目,如果您对N的平方根有哈希冲突,那么在生成哈希码时也可能遍历了整个数组,而忽略了比较比哈希慢的事实.

Considering actual use cases: By far the two most likely things here are byte strings and actual files being checked for sameness. In either case, taking a hash code of the first few bytes is most likely a bad idea. String's hash code uses the whole string, so byte strings should do the same, and most files being duplicated don't have a unique prefix in the first few bytes. For N entries, if you have hash collisions for the square root on N, you might as well have walked the entire array when generating the hash code, neglecting the fact that compares are slower than hashes.

这篇关于字节数组的HashSet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆