在迭代其元素的同时更新集合 [英] Updating a set while iterating over its elements

查看:55
本文介绍了在迭代其元素的同时更新集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试在迭代其元素时更新集合时,其行为应该是什么?

When I try to update a set while iterating over its elements, what should be its behavior?

我在各种情况下进行了尝试,并且没有迭代开始迭代后添加的元素,也没有迭代在迭代过程中删除的元素.如果我在迭代过程中删除并放回任何元素,则会考虑该元素.确切的行为是什么?它如何工作?

I tried it over various scenarios and it does not iterate over elements added after iteration is started and also the elements removed during iteration. If I remove and put back any element during iteration, that element is being considered. What's the exact behavior and how does it work?

这将打印字符串的所有排列:

This prints all the permutations of a string:

def permutations(s):
    ans = []
    def helper(created, remaining):
        if len(created) == len(s):
            ans.append(''.join(created))
            return
        for ch in remaining:
            remaining.remove(ch)
            created.append(ch)
            helper(created, remaining)
            remaining.add(ch)
            created.pop()
    helper([], set(s))
    return ans

这里的行为是无法预测的,有时会打印 e ,而有时却不会:

Here the behavior is unpredictable, sometimes e is printed while sometimes it's not:

ab = set(['b','c','d'])
x = True
for ch in ab:
    if x:
        ab.remove('c')
        ab.add('e')
        x = False
    print(ch)

在这里,我总是只看到一次'c'.即使第一个字符为'c':

Here I always see 'c' only once. Even when first character is 'c':

ab = set(['b','c','d'])
x = True
for ch in ab:
    if x:
        ab.remove('c')
        ab.add('c')
        x = False
    print(ch)

以及实现上述功能相同目的的另一种方法:

And an alternate way to achieve the same objective of the above function:

def permwdups(s):
    ans = []
    def helper(created, remaining):
        if len(created) == len(s):
            ans.append(''.join(created))
            return
        for ch in remaining:
            if (remaining[ch]<=0):
                continue
            remaining[ch] -=1
            created.append(ch)
            helper(created, remaining)
            remaining[ch] +=1
            created.pop()
    counts = {}
    for i in range(len(s)):
        if s[i] not in counts:
            counts[s[i]] = 1
        else:
            counts[s[i]]+= 1
    helper([], counts)
    print(len(set(ans)))
    return ans

推荐答案

实际上非常简单, set 在CPython中以 hash - item 表格:

It's actually very easy, sets are implemented in CPython as hash - item table:

  hash  |  item  
-----------------
    -   |    -
-----------------
    -   |    -
       ...

CPython使用开放地址,因此并非表中的所有行都被填充,并且在发生冲突的情况下,它使用伪随机"位置确定来基于项目的(被截断的)哈希值来确定元素的位置.在该答案中,我将忽略截断的哈希冲突.

CPython uses open-addressing so not all rows in the table are filled and it determines the position of an element based on the (truncated) hash of the item with a "pseudo-randomized" position determination in case of collisions. I will ignore truncated-hash-collisions in this answer.

我还将忽略哈希截断的细节,只使用整数,所有整数(有些例外)将其哈希映射到实际值:

I'll also ignore the specifics of the hash-truncation and just work with integers, which all (with some exceptions) map their hash to the actual value:

>>> hash(1)
1
>>> hash(2)
2
>>> hash(20)
20

因此,当您创建具有值1、2和3的 set 时,将具有(大致)下表:

So when you create a set with the values 1, 2 and 3 you'll have (roughly) the following table:

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1
-----------------
    2   |    2
-----------------
    3   |    3
       ...

从表的顶部到表的末尾迭代该集合,并且忽略空的行".因此,如果您在不修改该集合的情况下进行迭代,则会得到数字1、2和3:

The set is iterated from the top of the table to the end of the table and empty "rows" are ignored. So if you would iterate over that set without modifying it you would get the numbers 1, 2 and 3:

>>> for item in {1, 2, 3}: print(item)
1
2
3

基本上,迭代器从第0行开始,看到该行为空,然后转到包含项 1 的行1.该项目由迭代器返回.下一次迭代是在第2行中,并在其中返回值,即 2 ,然后转到第3行并返回在此处存储的 3 .在下面的迭代中,迭代器位于第4行,该行为空,因此将其移至第5行,该行也为空,然后移至第6行,直到到达表末尾并抛出 StopIteration 异常,表示迭代器已完成.

Basically the iterator starts in row 0 and sees that the row is empty and goes to row 1 which contains the item 1. This item is returned by the iterator. The next iteration it's in row 2 and returns the value there, which is 2, then it goes to row 3 and returns the 3 that is stored there. The following iteration the iterator is in row 4 which is empty, so it goes to row 5 which is also empty, then to row 6, .... until it reaches the end of the table and throws a StopIteration exception, which signals that the iterator finished.

但是,如果您在迭代集合时更改集合,则会记住该迭代器的当前位置(行).这意味着,如果您在前一行中添加项目,则迭代器不会返回该项目,如果在后一行中添加该项目,则将返回该项目(至少,如果在迭代器出现之前没有再次将其删除).

However if you would change the set while iterating over it the current position (row) where the iterator is is remembered. That means if you add an item in a previous row the iterator won't return it, if it's added in a later row it will be returned (at least if it's not removed again before the iterator is there).

假设,您总是删除当前项目,并向集合中添加一个 item + 1 的整数.像这样:

Assume, you always remove the current item and add an integer which is item + 1 to the set. Something like this:

s = {1}
for item in s: 
    print(item)
    s.discard(item)
    s.add(item+1)

迭代之前的集合如下:

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1
-----------------
    -   |    -
       ...

迭代器将从第0行开始,发现它为空,然后转到包含值 1 的行1,然后将其返回并打印.如果箭头指示迭代器的位置,则在第一次迭代中将如下所示:

The iterator will start in row 0 find it is empty and goes to row 1 which contains the value 1 which is then returned and printed. If the arrow would indicate the position of the iterator it would look like this in the first iteration:

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1      <----------
-----------------
    -   |    -

然后删除 1 ,并添加2:

  hash  |  item  
-----------------
    -   |    -
-----------------
    -   |    -      <----------
-----------------
    2   |    2

因此,在下一次迭代中,迭代器将找到值 2 并将其返回.然后添加两个,并添加一个3:

So in the next iteration the iterator finds the value 2 and returns it. Then the two is added and a 3 is added:

  hash  |  item  
-----------------
    -   |    -
-----------------
    -   |    -
-----------------
    -   |    -      <----------
-----------------
    3   |    3

以此类推.

直到达到 7 .那时发生了一些有趣的事情: 8 的截断的哈希意味着 8 将被放入第0行,但是第0行已经被迭代,因此它将以 7 .实际值可能取决于Python版本和集合的添加/删除历史记录,例如,只需更改 set.add set.discard 操作的顺序即可不同的结果(由于调整了大小,最多可以有15个结果.)

Until it reaches 7. At that point something interesting happens: The truncated hash of 8 means that the 8 will be put in row 0, however row 0 has already been iterated over so it will stop with 7. The actual value might depend on Python version and the add/removal history of the set, for example just changing the order of the set.add and set.discard operations will give a different result (goes up to 15 because the set is resized).

出于相同的原因,如果迭代器在每次迭代中添加 item-1 ,则迭代器将仅显示 1 ,因为 0 为(由于哈希值0)到第一行:

For the same reason the iterator would only display 1 if one would add the item - 1 in each iteration, because 0 would be (because of hash 0) to the first row:

s = {1}
for item in s: 
    print(item)
    s.discard(item)
    s.add(item-1)

  hash  |  item  
-----------------
    -   |    -
-----------------
    1   |    1      <----------
-----------------
    -   |    -

  hash  |  item  
-----------------
    0   |    0
-----------------
    -   |    -
-----------------
    -   |    -      <----------

通过简单的GIF可视化:

Visualized with a simple GIF:

请注意,这些示例非常简单,如果在迭代过程中调整 set 的大小,它将根据新"截断的哈希值重新分配存储的项目,并且还将删除剩余的虚拟变量从集合中删除项目时出现在后面.在那种情况下,您仍然可以(大致)预测会发生什么,但是它将变得更加复杂.

Note that these examples are very simplistic, if the set resizes during the iteration it will re-distribute the stored items based on the "new" truncated hash and it will also remove dummies that are left behind when you remove an item from a set. In that case you still can (roughly) predict what will happen but it will become a lot more convoluted.

另外一个非常重要的事实是,Python(自Python 3.4起)对每个解释器随机化了字符串的哈希值.这意味着每个Python会话都会为字符串产生不同的哈希值.因此,如果您在迭代时添加/删除字符串,则行为也是随机的.

One additional but very important fact is that Python (since Python 3.4) randomizes the hash value of strings for each interpreter. That means that each Python session will produce different hash-values for strings. So if you add/remove strings while iterating the behavior will also be random.

假设您具有以下脚本:

s = {'a'}
for item in s: 
    print(item)
    s.discard(item)
    s.add(item*2)

,然后在命令行中多次运行它,结果将有所不同.

and you run it multiple times from the command line the result will be different.

例如我的第一次跑步:

'a'
'aa'

我的第二/第三/第四次跑步:

My second / third / fourth run:

'a'

我的第五次跑步:

'a'
'aa'

这是因为命令行中的脚本始终会启动一个新的解释器会话.如果您在同一会话中多次运行脚本,结果将不会有所不同.例如:

That's because scripts from the command line always start a new interpreter session. If you run the script multiple times in the same session the results won't differ. For example:

>>> def fun():
...     s = {'a'}
...     for item in s: 
...         print(item)
...         s.discard(item)
...         s.add(item*2)

>>> for i in range(10):
...     fun()

产生:

a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa
a
aa

但是它也可以给出10倍的 a 或10倍的 a aa aaaa ,...

But it could also have given 10 times a or 10 times a, aa and aaaa, ...

总结一下:

  • 如果将项目放置在尚未迭代的位置,则将显示在迭代过程中添加到集合的值.位置取决于项目的截断的哈希值和碰撞策略.

  • A value added to a set during iteration will be displayed if the item is put in a position that hasn't been iterated over. The position depends on the truncated hash of the item and the collision strategy.

哈希的截断取决于集合的大小,而大小取决于集合的添加/删除历史记录.

The truncation of the hash depends on the size of the set and that size depends on the add/removal history of the set.

带有字符串的人无法预测位置,因为在最近的Python版本中,它们的哈希值是基于每个会话随机分配的.

With strings one cannot predict the position because their hash values are randomized on a per-session-basis in recent Python versions.

最重要的是:避免在迭代过程中修改set/list/dict/....它几乎总是会导致问题,即使不是这样,也会使任何人混淆阅读它!尽管在极少数情况下,在列表上进行迭代时将元素添加到列表是有意义的.那将需要非常具体的注释,否则看起来像是一个错误!特别是对于集合和字典,您将依赖可能随时更改的实现细节!

And most importantly: Avoid modifying the set / list / dict / ... while iterating over it. It almost always leads to problems and even if it doesn't it will confuse anyone reading it! Although there are a few, very rare cases where adding elements to a list while iterating over it makes sense. That would require very specific comments alongside it, otherwise it would look like a mistake! Especially with sets and dicts you will rely on implementation details that might change at any point!

以防万一,我用Jupyter笔记本电脑中的Cython内省功能(有点脆弱,可能只在Python 3.6上工作并且绝对不能在生产代码中使用)检查过该集的内部:

Just in case you're curious I inspected the internals of the set using (somewhat fragile and probably only works on Python 3.6 and definitely not usable in production-code) Cython introspection in a Jupyter Notebook:

%load_ext Cython

%%cython

from cpython cimport PyObject, PyTypeObject
cimport cython

cdef extern from "Python.h":
    ctypedef Py_ssize_t Py_hash_t

    struct setentry:
        PyObject *key
        Py_hash_t hash

    ctypedef struct PySetObject:
        Py_ssize_t ob_refcnt
        PyTypeObject *ob_type
        Py_ssize_t fill
        Py_ssize_t used
        Py_ssize_t mask
        setentry *table
        Py_hash_t hash
        Py_ssize_t finger

        setentry smalltable[8]
        PyObject *weakreflist

cpdef print_set_table(set inp):
    cdef PySetObject* innerset = <PySetObject *>inp
    for idx in range(innerset.mask+1):
        if (innerset.table[idx].key == NULL):
            print(idx, '<EMPTY>')
        else:
            print(idx, innerset.table[idx].hash, <object>innerset.table[idx].key)

在集内打印键值表:

a = {1}
print_set_table(a)

for idx, item in enumerate(a):
    print('\nidx', idx)
    a.discard(item)
    a.add(item+1)
    print_set_table(a)

请注意,输出将包含虚拟对象(已删除的set-item的剩余物),有时它们会消失(当set get的大小太满时).

Note that the output will contain dummys (left-overs from deleted set-items) and they will sometimes disappear (when the set get's too full or resized).

这篇关于在迭代其元素的同时更新集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆