检查值时,Set如何实际在内部工作? [英] How does Set actually work internal when checking for values?

查看:96
本文介绍了检查值时,Set如何实际在内部工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个非常笼统的基于计算机科学的问题,但是根据有关其工作原理的文献来看,这似乎并不直观.这是一个与语言无关的问题,但与Set数据类型在内部如何工作有关.

This is a very general computer science based question, but one that doesn't seem intuitive based on the literature about how they work. It's a language agnostic question, but relates to how a Set data type works internally.

我已经多次使用它们,建议使用它们来存储唯一值并快速访问它们.假设使用Big-O表示法,则每次访问Set时,其时间和复杂度均为O(1).如果Set包含数千个项目,怎么可能?即使项目是唯一的.

I have used them many times, and it is recommended to use them to store unique values and quickly access them. Supposedly in Big-O notation its time and complexity is O(1) every time the Set is accessed. How is that possibly if the Set may contain thousands of items? Even if the items are unique.

为了在Set中找到一个项目,它仍然必须扫描每个唯一的项目,在Big-O中,时间和复杂度为O(n).我在这里想念什么吗?

In order to find an item in the Set, it still has to scan each and every unique item, which in Big-O is O(n) in time and complexity. Is there anything I'm missing here?

在此先感谢您的帮助!最彻底的答案会得到投票!

Thanks in advance for your help! Most thorough answer gets the up-vote!

推荐答案

Set是一种更通用的对象的示例,统称为HashedCollections.它们使用某种HashTable来实际存储和检索其元素.

A Set is an example of a more general kind of objects collectively known as HashedCollections. These use some sort of HashTable to actually store and retrieve their elements.

给定任何element,这些表为其计算一个整数值,称为hash.有几种众所周知的技术来定义元素及其hash值之间的映射.有些是本征,因为hash不依赖于element的属性,该属性可能会改变,因此hashelement.其他的是外在的,因为它们可能取决于属性.但是,在后一种情况下,假定从HashedCollection引用特定元素时将不会对其进行修改(否则HashedCollection必须为rehashed).

Given any element, these tables compute an integer value for it named its hash. There are several well-known techniques to define the mapping between elements and their hash values. Some are intrinsic, in the sense that the hash does not depend on the attributes of the element, which might change, and hence the hash remains the same along the life of the element. Others are extrinsic, in the sense that they may depend on the attributes. However, in the latter case, it is supposed that the particular elements will not be modified while they are referenced from a HashedCollection (otherwise the HashedCollection must be rehashed).

element的存储过程如下:

  1. hash是为element计算的.
  2. 该表的index被计算为该表的lengthhash的余数.
  3. 如果已计算出index处的插槽,则将应用某些策略来解决冲突.
  1. The hash is computed for the element.
  2. An index to the table is computed as the remainder of the hash modulo the length of the table.
  3. If the slot at the index so calculated is already taken, some policy is applied to resolve the collision.

第1步应该非常快(例如,hash没有cryptographic强度).

Step 1 is supposed to be really fast (e.g., the hash has not cryptographic strength).

第2步假设(在大多数情况下)表的长度是素数的数字(也使用了2的幂)

Step 2 assumes (in most of the cases) that the length of the table is a prime number (powers of 2 are also used)

第3步基本上可以通过两种不同的方式解决:

Step 3 can be resolved in basically two different ways:

  • 依次扫描表j次,直到index + j的插槽可用,或者
  • 该元素被添加到在给定index( bucket )
  • 上发生冲突的元素集合中
  • The table is sequentially scanned j times until the slot at index + j is free, or
  • The element is added to the collection of elements colliding at the given index (bucket)

此外,如果没有足够的空插槽(这会增加发生碰撞的可能性),则会放大表格并显示rehashed(因为更改了modulo).

In addition, if there aren't enough empty slots (which increases the probability of collision), the table is enlarged and rehashed (because the modulo changed).

如果有足够的可用插槽和索引机制相当随机的分布,则在O(1)中找到所需插槽的可能性非常高.当然,如果有太多的元素碰撞,则平均复杂度不再O(1),但是可以认为通过增加策略(+ rehash)可以减轻这种复杂性.

With sufficient free slots and a fairly random distribution of the indexing mechanism, the probability of finding the desired slot in O(1) is very high. Of course, if too many elements collide, the average complexity is no longer O(1), but this is supposedly mitigated by the growing policy (+ rehash).

检索类似.为了检查element是否属于该集合,需要计算其hashmodulo并将element与目标插槽的内容进行比较.如果比较失败,则搜索将在存储桶中线性进行.

The retrieval is similar. To check whether an element belongs in the collection, its hash and modulo are computed and the element is compared with the contents of the target slot. If the comparison fails, the search proceeds linearly in the bucket.

在没有bucket且增加indexes的情况下,元素的移除有些困难,但是您明白了.

The removal of elements is somewhat more difficult when there is no bucket and instead the indexes are increased, but you get the idea.

如果您真的想了解所有这些工作,请继续进行调试,并在任何Smalltalk方言中调试HashedCollections的基本操作.保证有很多乐趣.

If you really want to see all of this at work, go ahead and debug the basic operations of HashedCollections in any Smalltalk dialect. Lots of fun guaranteed.

这篇关于检查值时,Set如何实际在内部工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆