字符串如何存储在VBA字典结构中? [英] How Strings are stored in a VBA Dictionary structure?

查看:265
本文介绍了字符串如何存储在VBA字典结构中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我目前正在玩大量的字符串(请查看另一个问题:数组和数组列表的VBA内存大小)我使用了一个脚本字典,仅用于具有键控访问的功能。
一切都看起来很好,除了它是一些加载字符串有多慢,它使用了大量的内存。对于长度为128个字符的100,000个字符串的示例,任务管理器在子结尾显示大约295 MB,并且当设置Dictionary = Nothing时,Excel中仍然存在差的12 MB。即使考虑内部Unicode转换的字符串128 * 2 * 100,000给了25.6 MB!有人可以解释这个很大的区别吗?

As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has. Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?

推荐答案

这是我可以在Scripting.Dictionary上找到的所有信息:

Here is all the info I could find on the Scripting.Dictionary:

根据Eric Lippert,谁编写了Scripting.Dictionary,通用字典的实际实现是一种可扩展哈希绑定算法,当表格太满时重新编排。 (从上下文可以清楚,他指的是Scripting.Dictionary)维基百科的哈希表的文章是对相关概念的一个很好的介绍。 (这里是Eric的博客搜索Scripting.Dictionary的搜索,他偶尔会提到它)

According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)

基本上,您可以将哈希表视为内存中的大数组。您不必直接通过索引存储您的字符串,而是必须提供一个键(通常是一个字符串)。密钥得到散列,也就是说,将一致的一组算法步骤应用于密钥,将其缩减为哈希表中0和当前最大索引之间的数字。该数字用作将字符串存储到哈希表中的索引。由于每次键值散列时都应用了相同的步骤,所以每次都会产生相同的索引,这意味着如果您通过其键查找字符串,则无需像通常那样搜索数组。

Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.

散列函数(将键转换为索引到表中)被设计为尽可能随机,但每一次两个键可以紧缩到相同的索引 - 这被称为碰撞。这是通过链接在一起的链接(或可能是更可搜索的结构)来处理的。所以假设你试图用一个键在哈希表中看一个字符串。关键是散列,你得到一个索引。在该索引中查找数组,如果没有添加具有该键的字符串,则可能是空插槽,或者可能是包含一个或多个字符串的链接列表,该字符串映射到数组中的该索引。

The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.

上面详细介绍的完整的原因是指出,哈希表必须大于存储的数量才能使其有效(除了一些例外,请参阅完美的散列函数)。在哈希表中看到的开销大部分是数组的空白部分,必须在这里才能使哈希表有效。

The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.

此外,调整大小哈希表是一项昂贵的操作,因为所有现有的字符串必须被重新映射到新的位置,所以当哈希表的负载因子超过预定义的阈值并且它被调整大小时,它的大小可能会增加一倍,以避免这样做。再次很快。

Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.

在每个数组位置保存字符串链的结构的实现也可能对开销产生很大的影响。

The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.

如果我发现别的东西,我会在这里添加...

If I find anything else out, I'll add it here...

这篇关于字符串如何存储在VBA字典结构中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆