为什么具有可为空值的结构的HashSets会异常慢? [英] Why are HashSets of structs with nullable values incredibly slow?

查看:93
本文介绍了为什么具有可为空值的结构的HashSets会异常慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我研究了性能下降,并将其跟踪到缓慢的HashSets。

我有带有可空值的结构用作主键。例如:

 公共结构NullableLongWrapper 
{
private只读长吗? _值;

public NullableLongWrapper(long?value)
{
_value = value;
}
}

我注意到创建了 HashSet< NullableLongWrapper> 非常慢。



以下是使用 BenchmarkDotNet :(安装包BenchmarkDotNet

 使用System.Collections.Generic; 
使用System.Linq;
使用BenchmarkDotNet.Attributes;
使用BenchmarkDotNet.Configs;
使用BenchmarkDotNet.Jobs;
使用BenchmarkDotNet.Running;

公共类程序
{
static void Main()
{
BenchmarkRunner.Run< HashSets>();
}
}

public class Config:ManualConfig
{
public Config()
{
Add(Job.Dry .WithWarmupCount(1).WithLaunchCount(3).WithTargetCount(20));
}
}

公共结构NullableLongWrapper
{
private只读吗? _值;

public NullableLongWrapper(long?value)
{
_value = value;
}

公共长?值=> _值;
}

公共结构LongWrapper
{
private只读long _value;

public LongWrapper(long value)
{
_value = value;
}

公开长度值=> _值;
}

[Config(typeof(Config))]
公共类HashSets
{
private const int ListSize = 1000;

私有只读列表< long?> _nullables;
私有只读List< long> _longs;
私有只读列表< NullableLongWrapper> _nullableWrappers;
私有只读列表< LongWrapper> _wrappers;

public HashSets()
{
_nullables = Enumerable.Range(1,ListSize).Select(i =>(long?)i).ToList();
_longs = Enumerable.Range(1,ListSize).Select(i =>(long)i).ToList();
_nullableWrappers = Enumerable.Range(1,ListSize).Select(i => new NullableLongWrapper(i))。ToList();
_wrappers = Enumerable.Range(1,ListSize).Select(i => new new LongWrapper(i))。ToList();
}

[Benchmark]
public void Longs()=>新的HashSet< long>(_ longs);

[Benchmark]
public void NullableLongs()=>新的HashSet< long?>(_ nullables);

[Benchmark(Baseline = true)]
public void Wrappers()=>新的HashSet< LongWrapper>(_ wrappers);

[基准]
public void NullableWrappers()=>新的HashSet< NullableLongWrapper>(_ nullableWrappers);
}

结果:

 
方法|中位数|缩放后的
----------------- | ---------------- | ---------
多头| 22.8682我们| 0.42
NullableLongs | 39.0337我们| 0.62
包装| 62.8877我们| 1.00
NullableWrappers | 231,993.7278我们| 3,540.34

使用结构为 Nullable 的结构一个 long 慢了3540倍!

就我而言,它使800ms和<1ms之差。



以下是BenchmarkDotNet中的环境信息:


OS = Microsoft Windows NT 6.1.7601 Service Pack 1

Processor = Intel(R)Core i7-5600U CPU 2.60GHz,ProcessorCount = 4

Frequency = 2536269 ticks,Resolution = 394.2799 ns,Timer = TSC

CLR = MS.NET 4.0.30319.42000,Arch = 64位发布[RyuJIT]

GC =并发工作站

JitModules = clrjit-v4.6.1076.0


这种表现不佳的原因是什么?

解决方案

之所以发生这种情况,是因为 _nullableWrappers 的每个元素都有 GetHashCode()返回的哈希码相同,导致哈希值退化为O(N)访问,而不是O(1)。



您可以通过打印所有哈希码来验证这一点。



如果您这样修改结构,则:

 公共结构NullableLongWrapper 
{
私有只读长吗? _值;

public NullableLongWrapper(long?value)
{
_value = value;
}

公共重写int GetHashCode()
{
return _value.GetHashCode();
}

公共长?值=> _值;
}

它的工作速度更快。



现在,明显的问题是为什么每个 NullableLongWrapper 的哈希码都是相同的。



答案是在此线程中讨论。但是,它并不能完全回答问题,因为Hans的答案围绕着具有两个字段的结构进行计算,在计算哈希代码时可以从中进行选择-但在此代码中,只有一个字段可供选择-这是一个值类型(a 结构)。



但是,这个故事的寓意是:从不依赖默认值 GetHashCode()用于值类型!






附录



我认为也许正在发生的事情与汉斯在我所链接的线程中的回答有关-也许它正在考虑 Nullable< T> 结构中的第一个字段(布尔值),我的实验表明它可能是相关的-但很复杂:



考虑以下代码及其输出:

 使用系统; 

公共课程Program
{
static void Main()
{
var a = new Test {A = 0,B = 0};
var b =新测试{A = 1,B = 0};
var c = new Test {A = 0,B = 1};
var d = new Test {A = 0,B = 2};
var e = new Test {A = 0,B = 3};

Console.WriteLine(a.GetHashCode());
Console.WriteLine(b.GetHashCode());
Console.WriteLine(c.GetHashCode());
Console.WriteLine(d.GetHashCode());
Console.WriteLine(e.GetHashCode());
}
}

公共结构测试
{
public int A;
public int B;
}

输出:

346948956
346948957
346948957
346948958
346948959

请注意第二个和第三个哈希码(对于1/0和0/1)如何相同,但其他都相同不同。我发现这很奇怪,因为清楚地更改A会像更改B一样更改哈希码,但是给定两个值X和Y,则对于A = X,B = Y和A = Y,B = X会生成相同的哈希码。 / p>

(听起来有些XOR东西正在幕后发生,但这是猜测。)



可以显示两个字段都有助于哈希码的行为证明 ValueType.GetHashType()的引用源中的注释不正确或错误:


操作:我们返回哈希码的算法有点复杂。我们寻找第一个非静态字段并获取其哈希码。如果类型没有非静态字段,则返回该类型的哈希码。我们不能使用静态成员的哈希码,因为如果该成员与原始类型具有相同的类型,则将导致无限循环。


如果该评论为真,则上例中的五个哈希码中的四个将相同,因为 A 具有相同的值, 0,所有这些。 (假设 A 是第一个字段,但是如果交换值,您将得到相同的结果:两个字段显然都对哈希码有贡献。)



然后我尝试将第一个字段更改为bool:

  using System; 

公共类程序
{
static void Main()
{
var a = new Test {A = false,B = 0};
var b =新测试{A = true,B = 0};
var c =新测试{A =假,B = 1};
var d = new Test {A = false,B = 2};
var e =新测试{A =假,B = 3};

Console.WriteLine(a.GetHashCode());
Console.WriteLine(b.GetHashCode());
Console.WriteLine(c.GetHashCode());
Console.WriteLine(d.GetHashCode());
Console.WriteLine(e.GetHashCode());
}
}

公共结构测试
{
公共布尔A;
public int B;
}

输出

346948956
346948956
346948956
346948956
346948956

哇!因此,使第一个字段变为布尔值,无论所有字段的值如何,都使所有哈希码都相同!



这仍然看起来像是某种错误



该错误已在.NET 4中修复,但仅适用于Nullable。自定义类型仍然会产生不良行为。


I investigated performance degradation and tracked it down to slow HashSets.
I have structs with nullable values that are used as a primary key. For example:

public struct NullableLongWrapper
{
    private readonly long? _value;

    public NullableLongWrapper(long? value)
    {
        _value = value;
    }
}

I noticed that creating a HashSet<NullableLongWrapper> is exceptionally slow.

Here's an example using BenchmarkDotNet: (Install-Package BenchmarkDotNet)

using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

public class Program
{
    static void Main()
    {
        BenchmarkRunner.Run<HashSets>();
    }
}

public class Config : ManualConfig
{
    public Config()
    {
        Add(Job.Dry.WithWarmupCount(1).WithLaunchCount(3).WithTargetCount(20));
    }
}

public struct NullableLongWrapper
{
    private readonly long? _value;

    public NullableLongWrapper(long? value)
    {
        _value = value;
    }

    public long? Value => _value;
}

public struct LongWrapper
{
    private readonly long _value;

    public LongWrapper(long value)
    {
        _value = value;
    }

    public long Value => _value;
}

[Config(typeof (Config))]
public class HashSets
{
    private const int ListSize = 1000;

    private readonly List<long?> _nullables;
    private readonly List<long> _longs;
    private readonly List<NullableLongWrapper> _nullableWrappers;
    private readonly List<LongWrapper> _wrappers;

    public HashSets()
    {
        _nullables = Enumerable.Range(1, ListSize).Select(i => (long?) i).ToList();
        _longs = Enumerable.Range(1, ListSize).Select(i => (long) i).ToList();
        _nullableWrappers = Enumerable.Range(1, ListSize).Select(i => new NullableLongWrapper(i)).ToList();
        _wrappers = Enumerable.Range(1, ListSize).Select(i => new LongWrapper(i)).ToList();
    }

    [Benchmark]
    public void Longs() => new HashSet<long>(_longs);

    [Benchmark]
    public void NullableLongs() => new HashSet<long?>(_nullables);

    [Benchmark(Baseline = true)]
    public void Wrappers() => new HashSet<LongWrapper>(_wrappers);

    [Benchmark]
    public void NullableWrappers() => new HashSet<NullableLongWrapper>(_nullableWrappers);
}

Result:

           Method |          Median |   Scaled
----------------- |---------------- |---------
            Longs |      22.8682 us |     0.42
    NullableLongs |      39.0337 us |     0.62
         Wrappers |      62.8877 us |     1.00
 NullableWrappers | 231,993.7278 us | 3,540.34

Using a struct with a Nullable<long> compared to a struct with a long is 3540 times slower!
In my case it made the difference between 800ms and <1ms.

Here is the environment information from BenchmarkDotNet:

OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-5600U CPU 2.60GHz, ProcessorCount=4
Frequency=2536269 ticks, Resolution=394.2799 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1076.0

What is the reason performance is this poor?

解决方案

This is happening because every one of the elements of _nullableWrappers has the same hash code returned by GetHashCode(), which is resulting in the hashing degenerating into O(N) access rather than O(1).

You can verify this by printing out all the hash codes.

If you modify your struct as so:

public struct NullableLongWrapper
{
    private readonly long? _value;

    public NullableLongWrapper(long? value)
    {
        _value = value;
    }

    public override int GetHashCode()
    {
        return _value.GetHashCode();
    }

    public long? Value => _value;
}

it works much more quickly.

Now, the obvious question is WHY is the hash code of every NullableLongWrapper the same.

The answer to that is discussed in this thread. However, it doesn't quite answer the question, since Hans' answer revolves around the struct having TWO fields from which to choose when computing the hash code - but in this code, there's only one field to choose from - and it's a value type (a struct).

However, the moral of this story is: Never rely on the default GetHashCode() for value types!


Addendum

I thought that perhaps what was happening was related to Hans' answer in the thread I linked - maybe it was taking the value of the first field (the bool) in the Nullable<T> struct), and my experiments indicate that it may be related - but it's complicated:

Consider this code and its output:

using System;

public class Program
{
    static void Main()
    {
        var a = new Test {A = 0, B = 0};
        var b = new Test {A = 1, B = 0};
        var c = new Test {A = 0, B = 1};
        var d = new Test {A = 0, B = 2};
        var e = new Test {A = 0, B = 3};

        Console.WriteLine(a.GetHashCode());
        Console.WriteLine(b.GetHashCode());
        Console.WriteLine(c.GetHashCode());
        Console.WriteLine(d.GetHashCode());
        Console.WriteLine(e.GetHashCode());
    }
}

public struct Test
{
    public int A;
    public int B;
}

Output:

346948956
346948957
346948957
346948958
346948959

Note how the second and third hash codes (for 1/0 and 0/1) are the same, but the others are all different. I find this strange because clearly changing A changes the hash code, as does changing B, but given two values X and Y, the same hash code is generated for A=X, B=Y and A=Y, B=X.

(That sounds like some XOR stuff is happening behind the scenes, but that's guess.)

Incidentally, this behaviour where BOTH fields can be shown to contribute to the hash code proves that the comment in the reference source for ValueType.GetHashType() is inaccurate or wrong:

Action: Our algorithm for returning the hashcode is a little bit complex. We look for the first non-static field and get it's hashcode. If the type has no non-static fields, we return the hashcode of the type. We can't take the hashcode of a static member because if that member is of the same type as the original type, we'll end up in an infinite loop.

If that comment was true, then four of the five hash codes in the example above would be the same, since A has the same value, 0, for all those. (That assumes A is the first field, but you get the same results if you swap the values around: Both fields clearly contribute to the hash code.)

Then I tried changing the first field to be a bool:

using System;

public class Program
{
    static void Main()
    {
        var a = new Test {A = false, B = 0};
        var b = new Test {A = true,  B = 0};
        var c = new Test {A = false, B = 1};
        var d = new Test {A = false, B = 2};
        var e = new Test {A = false, B = 3};

        Console.WriteLine(a.GetHashCode());
        Console.WriteLine(b.GetHashCode());
        Console.WriteLine(c.GetHashCode());
        Console.WriteLine(d.GetHashCode());
        Console.WriteLine(e.GetHashCode());
    }
}

public struct Test
{
    public bool A;
    public int  B;
}

Output

346948956
346948956
346948956
346948956
346948956

Wow! So making the first field a bool makes all the hash codes come out the same, regardless of the values of ANY of the fields!

This still looks like some kind of bug to me.

The bug has been fixed in .NET 4, but only for Nullable. Custom types still yield the bad behavior. source

这篇关于为什么具有可为空值的结构的HashSets会异常慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆