与dict()相比,Python OrderDict溅射 [英] Python OrderDict sputtering as compared to dict()

查看:55
本文介绍了与dict()相比,Python OrderDict溅射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个让我完全困惑.

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        #row = collections.OrderedDict([("computer_name", key_host), ("id", index), ("hist_item", hist_item)])
        row = {"computer_name": key_host, "id": index, "hist_item": hist_item}
        asset_hist.append(row)

此代码与注释掉的收藏行完美配合.但是,当我注释掉row = dict行并将注释从集合行中删除时,事情变得很奇怪.这些行中大约有400万行已生成并附加到asset_hist.

因此,当我使用row = dict时,整个循环在大约10毫秒内完成,快如闪电.当我使用有序词典时,我已经等待了10分钟以上,但仍然没有完成.现在,我知道OrderDict应该比字典慢一些,但在最坏的情况下应该慢10倍左右,而根据我的数学计算,该函数实际上要慢100,000倍.

我决定在最下面的循环中打印索引,以查看发生了什么.有趣的是,我注意到控制台输出出现了溅射.该索引将在屏幕上快速打印,然后停止约3-5秒,然后再继续操作.

am_output.asset_history是一本字典,它具有一个键,主机,并且每一行都是一个字符串列表.例如

am_output.asset_history = {"host1":["string1","string2",...],"host2":["string1","string2",...],...}

使用OrderedDict进行溅射分析

此VM服务器上的总内存:仅8GB ...需要进一步配置.

圈数

184796(约5秒的等待时间,约60%的内存使用量)

634481(〜5秒的等待时间,〜65%的内存使用量)

1197564(约5秒的等待时间,约70%的内存使用情况)

1899247(约5秒的等待时间,约75%的内存使用量)

2777296(约5秒的等待时间,约80%的内存使用量)

3873730(长时间等待...等待了20分钟并放弃了!,内存使用率为88.3%,进程仍在运行)

等待的位置随每次运行而变化.

编辑:再次运行它,这次它停止在3873333上,接近之前停止的位置.它在形成行之后停止,同时尝试追加...我没有注意到这最后一次尝试,但是那也在那里...问题出在追加行,而不是行行...我仍然莫名其妙.这是它在长止点之前产生的行(将行添加到print语句中)...更改主机名以保护无辜的人:

3873333:OrderedDict([[''computer_name','bg-fd5612ea'),('id',1),('hist_item',"sys1 Normalizer(sys1-4):无法从sys1名称确定域名'bg-fd5612ea'.)])

解决方案

正如您自己的测试所证明的,您的内存不足.即使在CPython 3.6(实际上已经订购了普通dict,尽管尚不能作为一种语言保证)上,与dict相比,OrderedDict仍具有显着的内存开销.它仍然可以通过边带链表实现,以保留顺序并支持简单的迭代,使用move_to_end重新排序等.您可以通过使用sys.getsizeof进行检查来判断(确切的结果因Python版本和构建位宽而异,32与64位):

>>> od = OrderedDict([("a", 1), ("b", 2), ("c", 3)])
>>> d = {**od}
>>> sys.getsizeof(od)
464   # On 3.5 x64 it's 512
>>> sys.getsizeof(d)
240   # On 3.5 x64 it's 288

忽略存储的数据,此处OrderedDict的开销几乎是普通dict的开销的两倍.如果您要制造400万个这样的项目,那么在我的机器上,这将增加850 MB以上的运行开销(在3.5和3.6上).

您系统上所有其他程序的组合,加上Python程序,可能已超过分配给计算机的RAM,并且您被卡在交换swap动中.特别是,每当asset_hist必须扩展以查找新条目时,很可能需要分页很大一部分(由于缺乏使用而被分页),并且每当循环垃圾收集运行触发时(每次发生一次完整的GC时,默认情况下有70,000个分配和释放),所有OrderedDict返回页面以检查是否仍在循环之外引用它们(您可以通过禁用循环GC 使用collections.namedtuple (专为轻量级设计)可通过名称或索引引用的对象(它们的作用类似于常规的tuple,但也允许将每个值作为命名属性访问),这将大大降低程序的内存成本(即使在内存不足的情况下也可能加快程序的内存消耗)一个问题).

例如:

from collections import namedtuple

ComputerInfo = namedtuple('ComputerInfo', ['computer_name', 'id', 'hist_item'])

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        asset_hist.append(ComputerInfo(key_host, index, hist_item))

唯一的区别是将row['computer_name']替换为row.computer_name,或者如果需要所有值,则可以像常规tuple一样将其解压缩,例如comphost, idx, hist = row.如果您暂时需要一个真正的OrderedDict(不要为所有内容存储它们),则可以调用row._asdict()以获取具有与namedtuple相同映射关系的OrderedDict,但这通常不是必需的.节省内存是有意义的.在我的系统上,三个元素namedtuple将每个项目的开销减少到72个字节,不到3.6 dict的三分之一,也小于3.6 OrderedDict的六分之一(而三个元素namedtuple在3.5上仍保留72个字节,其中dict/OrderedDict在3.6之前的版本中较大).它可能节省的更多. tuple(和扩展名namedtuple)被分配为单个连续的C struct,而dict和company至少是两个分配(一个分配给对象结构,一个或多个分配给动态调整大小的部分).结构),每个都可能要支付分配器的开销和对齐成本.

无论哪种方式,对于您的四百万行方案,使用namedtuple意味着总共要支付275 MB的开销(超出值的成本),而OrderedDict的1770(3.6)-1950(3.5)MB.在谈论8 GB系统时,将开销减少1.5 GB是一项重大改进.

This one has me entirely baffled.

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        #row = collections.OrderedDict([("computer_name", key_host), ("id", index), ("hist_item", hist_item)])
        row = {"computer_name": key_host, "id": index, "hist_item": hist_item}
        asset_hist.append(row)

This code works perfectly with the collections line commented out. However, when I comment out the row = dict line and remove the comment from the collections line things get very strange. There are about 4 million of these rows being generated and appended to asset_hist.

So, when I use row=dict, the entire loop finishes in about 10 milliseconds, lightning fast. When I use the ordered dictionary, I've waited over 10 minutes and it still didn't finish. Now, I know OrderDict is supposed to be a little slower than a dict, but it's supposed to be about 10x slower at worst and by my math its actually about 100,000 times slower in this function.

I decided to print the index in the lowest loop to see what was happening. Interestingly enough, I noticed a sputtering in console output. The index would print very fast on the screen and then stop for about 3-5 seconds before continuing on.

am_output.asset_history is a dictionary which has one key, host, and every row is a list of strings. E.g.

am_output.asset_history = {"host1": ["string1", "string2", ...], "host2": ["string1", "string2", ...], ...}

EDIT: Sputter Analysis with OrderedDict

Total Memory on this VM Server: Only 8GB... need to get more provissioned.

LOOP NUM

184796 (~5 second wait, ~60% memory usage)

634481 (~5 second wait, ~65% memory usage)

1197564 (~5 second wait, ~70% memory usage)

1899247 (~5 second wait, ~75% memory usage)

2777296 (~5 second wait, ~80% memory usage)

3873730 (LONG WAIT... waited 20 minutes and gave up!, 88.3% memory usage, process is still running)

Where the wait happens changes with each run.

EDIT: Ran it again, this time it stop on 3873333, close to the spot it stopped before. It stopped after forming the row, while trying to append... I didn't notice this last attempt but it was there then too... the problem is with the append line, not the row line... I'm still baffled. Here's the row it produced right before the long stop (added the row to the print statement)... hostname changed to protect the innocent:

3873333: OrderedDict([('computer_name', 'bg-fd5612ea'), ('id', 1), ('hist_item', "sys1 Normalizer (sys1-4): Domain Name cannot be determined from sys1 Name 'bg-fd5612ea'.")])

解决方案

As your own tests prove, you're running out of memory. Even on CPython 3.6 (where plain dict is actually ordered, though not as a language guarantee yet), OrderedDict has significant memory overhead compared to dict; it's still implemented with a side-band linked list to preserve the order and support easy iteration, reordering with move_to_end, etc. You can tell just by checking with sys.getsizeof (exact results will differ by Python version and build bitwidth, 32 vs. 64 bit):

>>> od = OrderedDict([("a", 1), ("b", 2), ("c", 3)])
>>> d = {**od}
>>> sys.getsizeof(od)
464   # On 3.5 x64 it's 512
>>> sys.getsizeof(d)
240   # On 3.5 x64 it's 288

Ignoring the data stored, the overhead for the OrderedDict here is nearly twice that of the plain dict. If you're making 4 million of these items, on my machine that would add overhead of a titch over 850 MB (on both 3.5 and 3.6).

It's likely the combination of all the other programs on your system, plus your Python program, is exceeding the RAM allocated to your machine, and you're stuck swap thrashing. In particular, whenever asset_hist has to expand for new entries, it's likely needing to page in large parts of it (that got paged out for lack of use), and whenever a cyclic garbage collection run triggers (a full GC occurs roughly every 70,000 allocations and deallocations by default), all the OrderedDicts get paged back in to check if they're still referenced outside of cycles (you could check if the GC runs are the main problem by disabling cyclic GC via gc.disable()).

Given your particular use case, I'd strongly recommend avoiding both dict and OrderedDict though. The overhead of even dict, even the cheaper form on Python 3.6, is kind of extreme when you have a set of exactly three fixed keys over and over. Instead, use collections.namedtuple, which is designed for lightweight objects referenceable by either name or index (they act like regular tuples, but also allow accessing each value as a named attribute), which would dramatically reduce the memory cost of your program (and likely speed it up even when memory is not an issue).

For example:

from collections import namedtuple

ComputerInfo = namedtuple('ComputerInfo', ['computer_name', 'id', 'hist_item'])

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        asset_hist.append(ComputerInfo(key_host, index, hist_item))

Only difference in use is that you replace row['computer_name'] with row.computer_name, or if you need all the values, you can unpack it like a regular tuple, e.g. comphost, idx, hist = row. If you need a true OrderedDict temporarily (don't store them for everything), you can call row._asdict() to get an OrderedDict with the same mapping as the namedtuple, but that's not normally needed. The memory savings are meaningful; on my system, the three element namedtuple drops the per-item overhead to 72 bytes, less than a third that of even the 3.6 dict and less than a sixth of a 3.6 OrderedDict (and three element namedtuple remains 72 bytes on 3.5, where dict/OrderedDict are larger pre-3.6). It may save even more than that; tuples (and namedtuple by extension) are allocated as a single contiguous C struct, while dict and company are at least two allocations (one for the object structure, one or more for the dynamically resizable parts of the structure), each of which may pay allocator overhead and alignment costs.

Either way, for your four million row scenario, using namedtuple would mean paying (beyond the cost of the values) overhead of around 275 MB total, vs. 915 (3.6) - 1100 (3.5) MB for dict and 1770 (3.6) - 1950 (3.5) MB for OrderedDict. When you're talking about an 8 GB system, shaving 1.5 GB off your overhead is a major improvement.

这篇关于与dict()相比,Python OrderDict溅射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆