为什么第32769个插入在std :: unordered_set中失败? [英] Why does the 32769th insert fail in std::unordered_set?

查看:54
本文介绍了为什么第32769个插入在std :: unordered_set中失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我生成了大量的类实例,并将它们存储在 std :: unordered_set 中.我已经定义了一个散列函数和一个相等关系,到目前为止,一切都可以正常工作-我用 unordered_set :: insert 插入了10000个实例,我可以用 unordered_set :: find找到它们.所有对象均未损坏,也没有暗示内存损坏或任何其他问题.

I generate a large number of class instances and store them in a std::unordered_set. I have defined a hash function and an equality relation, and so far everything works as it should - I insert 10000 instances with unordered_set::insert, and I can find them with unordered_set::find. All the objects are undamaged, and there is no hint on memory corruption or any other issue.

但是,当我继续插入时,第32769次插入失败-它不会抛出,但是会返回一对,其中迭代器为 == nullptr (0x00000000). insert 定义为:

However, when I keep inserting, the 32769th insert fails - it doesn't throw, but it returns a pair where the iterator is == nullptr (0x00000000). insert is defined as:

pair<iterator, bool> insert(const value_type& Val);

,通常, * iterator 是我插入的键,布尔值是 true .
如果我(在错误之后)尝试找到该对象,则它在集合中;如果我尝试再次插入,它会告诉我它已经在那里;因此插入似乎效果很好.只是返回的值是 pair< nullptr,true> ,而不是 pair< iterator,bool> .
请注意,如果我手动填充迭代器并继续在调试器中运行,则相同的问题会在65536之后的第一个插入处再次发生,然后在131072等处再次发生(对于2 ^ 15 + 1、2 ^ 16 + 1、2^ 17 + 1,...)-但不是3 * 32768 + 1,等等.

and normally, the *iterator is the key I inserted, and the bool is true.
If I (after the error) try to find the object, it is in the set; if I try to insert it again, it tells me its already there; so the insert seems to have worked fine. Just the returned value is pair<nullptr,true> instead pair<iterator,bool>.
Note that if I hand-fill the iterator and continue in the debugger, the same issue happens again at the first insert after 65536, and then at 131072, etc. (so for 2^15+1, 2^16+1, 2^17+1, ...) - but not at 3 * 32768+1, etc.

在我看来,这似乎有些 short 溢出.也许我的哈希值真的很糟糕,导致水桶装满不均匀,而在32768时,它用完了水桶?谷歌搜索时,我找不到关于此限制的任何更详细的信息,并且我对平衡树或内部的任何事物都不了解.
尽管如此,std库代码应该能够处理错误的哈希,我知道它是否变慢且效率低下,但它不应该失败.

To me, this looks like some short overflow. Maybe my hashes are really bad and lead to uneven filling of buckets, and at 32768 it runs out of buckets? I could not find anything more detailed about such a limit when googling, and I don't know enough about balanced trees or whatever this is internally.
Still, the std library code should be able to handle bad hashing, I understand if it gets slow and inefficient, but it shouldn't fail.

问题:为什么2 ^ 15th + 1、2 ^ 16th + 1等插入失败,如何避免?

这是Microsoft Visual Studio 2017 V15.7.1(最新版本为2018-05-15).编译器设置为使用C ++ 2017规则,但我怀疑它会产生什么影响.
我无法粘贴完整的代码以寻求最小可行的解决方案,因为对象生成跨多个类和方法很复杂,并且有数百行代码,因此生成的哈希显然取决于对象的细节,并且在以下情况下不易复制伪代码.

This is in Microsoft Visual Studio 2017 V15.7.1 (latest version as of 2018-05-15). Compiler is set to use C++2017 rules, but I doubt it makes any impact.
I cannot paste the complete code for a minimum viable solution, as the object generation is complex across multiple classes and methods, and has several hundreds lines of code, the generated hashes obviously depend on the details of the objects, and are not easily reproducible in dummy code.

###一天后更新### :(我无法将其放在答案中,因为q处于保留状态)在对标准库进行了广泛的调试(包括很多麻烦之后)之后,@ JamesPoag的答案原来指向正确的东西.
插入 n 后,我得到:

### Update after one day ###: (I cannot put this in an answer, because the q was put on hold) After extensive debugging of the standard library (including a lot of head-scratching), @JamesPoag's answer turns out to point to the right thing.
After n inserts, I get:

  n     load_factor  max_load_factor  bucket_count  max_bucket_count
32766   0.999938965  1.00000000       32768         536870911 (=2^29-1)
32767   0.999969482  1.00000000       32768         536870911
32768   1.000000000  1.00000000       32768         536870911
32769   0.500000000  1.00000000       65536         536870911

不足为奇,插入32768后,负载系数已达到最大值.在内部方法_Check_Size中,第32769次插入触发了对更大表的重新哈希处理:

not surprising, after 32768 inserts, the load factor has reached its maximum. The 32769th insert triggers a rehash to bigger table, inside the internal method _Check_Size:

void _Check_size()
        {    // grow table as needed
        if (max_load_factor() < load_factor())

            {    // rehash to bigger table
            size_type _Newsize = bucket_count();

            if (_Newsize < 512)
                _Newsize *= 8;    // multiply by 8
            else if (_Newsize < _Vec.max_size() / 2)
                _Newsize *= 2;    // multiply safely by 2
            _Init(_Newsize);
            _Reinsert();
            }
        }

最后,调用 _Reinsert()并将所有32769键填充到新存储桶中,并设置所有 _next _prev 指针相应.很好.
但是,调用这两个代码的代码看起来像这样( Plist my 集合的名称,该代码是从模板生成的):

at the end, _Reinsert() is called and fills all 32769 keys into the new buckets, and _sets all the _next and _prev pointers accordingly. That works fine.
However, the code that is calling those two looks like this (Plist is my set's name, this code gets generated from a template):

_Insert_bucket(_Plist, _Where, _Bucket);

_TRY_BEGIN
_Check_size();
_CATCH_ALL
erase(_Make_iter(_Plist));
_RERAISE;
_CATCH_END

return (_Pairib(_Make_iter(_Plist), true));
}

关键点在最后一行-_Plist用于构建该对,但它现在包含一个指向 _next 的死指针,因为所有存储区的地址都在 _Check_size()中进行了重建,前面几行.我认为这是std库中的错误-在这里它需要在新集中找到 _Plist ,看起来像一样,但是具有有效的 _next 指针.

The critical point is in the last line - _Plist is used to build the pair, but it holds a now dead pointer to _next, because all bucket's addresses were rebuild in _Check_size(), some lines earlier. I think this is an error in the std library - here it needs to find _Plist in the new set, where it looks the same, but has a valid _next pointer.

一个简单的修复"程序(已验证能正常工作),可以在关键的 insert 之前扩展该集合:
if(mySet.size()== mySet.bucket_count())mySet.rehash(mySet.bucket_count()* 2); .

An easy 'fix' is (verified to work) to expand the set right before the critical insert:
if (mySet.size() == mySet.bucket_count()) mySet.rehash(mySet.bucket_count() * 2);.

###进一步更新:### 我已经进行了广泛的尝试(超过16个小时),以产生一个重现问题的最小代码,但是我还没有能力.我将尝试记录现有大代码的实际计算出的哈希值.
我发现的一件事是,在插入和重新映射键之间(无意地)更改了键之一的一个哈希值(无意间).这可能是根本原因.如果我将重新哈希处理移到了插入内容之外,问题就解决了.
我不确定是否有一定的规则必须保持哈希值不变,但这可能是有道理的,那么您又怎么能找到密钥呢?

### Further Update: ### I have been trying extensively (16+ hours) to produce a minimum code that reproduces the issue, but I was not yet able to. I'll try to log the actual calculated hashes for the existing large code.
One thing I found is that one hash value of one of the keys changed (unintentionally) between being inserted and being rehashed. This might be the root cause; if I move the rehashing outside of the insert, the issue is gone.
I am not sure if there is a rule that hashes have to be constant, but it probably makes sense, how else could you find the key again.

推荐答案

我将一些简单的代码插入到godbolt.org中,以查看输出是什么,但是什么都没跳出来.

I plugged some simple code into godbolt.org to see what the output was, but nothing was jumping out at me.

我怀疑已插入Value并创建了迭代器,但是插入超出了 max_load_factor 并触发重新哈希.在Rehash上,先前的迭代器无效.在这种情况下,返回迭代器可能会被清零(或从不设置)(再次在反汇编中找不到).

I suspect that Value is inserted and the iterator is created, but the insertion exceeds the max_load_factor and triggers a rehash. On Rehash, the previous iterators are invalidated. The return iterator might be zeroed out in this case (or never set) (again I can't find it in the disassembly).

在有问题的插入之前和之后检查load_value(),max_load_value()和bucket_count().

Check the load_value(), max_load_value() and bucket_count() before and after the offending insert.

这篇关于为什么第32769个插入在std :: unordered_set中失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆