优化Python字典访问代码 [英] Optimising Python dictionary access code

查看:80
本文介绍了优化Python字典访问代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

我已经将我的Python程序剖析为死亡,并且有一个函数正在减慢一切.它大量使用Python字典,因此我可能没有以最佳方式使用它们.如果无法使其更快地运行,我将不得不用C ++重新编写它,那么有人可以帮助我在Python中对其进行优化吗?

I've profiled my Python program to death, and there is one function that is slowing everything down. It uses Python dictionaries heavily, so I may not have used them in the best way. If I can't get it running faster, I will have to re-write it in C++, so is there anyone who can help me optimise it in Python?

我希望我给出了正确的解释,并且您可以对我的代码有所了解!预先感谢您的帮助.

I hope I've given the right sort of explanation, and that you can make some sense of my code! Thanks in advance for any help.

我的代码:

这是令人反感的功能,使用 line_profiler和kernprof 进行了分析.我正在运行python 2.7

This is the offending function, profiled using line_profiler and kernprof. I'm running Python 2.7

我尤其对第363、389和405行感到困惑,其中带有两个变量的比较的if语句似乎花费了过多的时间.

I'm particularly puzzled by things like lines 363, 389 and 405, where an if statement with a comparison of two variables seems to take an inordinate amount of time.

我已经考虑过使用 NumPy (因为它稀疏矩阵),但我认为不合适因为:(1)我没有使用整数对矩阵进行索引(我正在使用对象实例); (2)我没有在矩阵中存储简单的数据类型(我正在存储浮点数和对象实例的元组). 但是我愿意被NumPy说服. 如果有人知道NumPy的稀疏矩阵性能与Python的哈希表,我会很感兴趣.

I've considered using NumPy (as it does sparse matrices) but I don't think it's appropriate because: (1) I'm not indexing my matrix using integers (I'm using object instances); and (2) I'm not storing simple data types in the matrix (I'm storing tuples of a float and an object instance). But I'm willing to be persuaded about NumPy. If anyone knows about NumPy's sparse matrix performance vs. Python's hash tables, I'd be interested.

对不起,我没有给出一个可以运行的简单示例,但是此功能与一个更大的项目捆绑在一起,我无法弄清楚如何设置一个简单的示例进行测试,而没有给您一半的费用.我的代码库!

Sorry I haven't given a simple example that you can run, but this function is tied up in a much larger project and I couldn't work out how to set up a simple example to test it, without giving you half of my code base!

Timer unit: 3.33366e-10 s
File: routing_distances.py
Function: propagate_distances_node at line 328
Total time: 807.234 s

Line #   Hits         Time  Per Hit   % Time  Line Contents
328                                               @profile
329                                               def propagate_distances_node(self, node_a, cutoff_distance=200):
330                                                       
331                                                   # a makes sure its immediate neighbours are correctly in its distance table
332                                                   # because its immediate neighbours may change as binds/folding change
333    737753   3733642341   5060.8      0.2          for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems():
334    512120   2077788924   4057.2      0.1              use_neighbour_link = False
335                                                       
336    512120   2465798454   4814.9      0.1              if(node_b not in self.node_distances[node_a]): # a doesn't know distance to b
337     15857     66075687   4167.0      0.0                  use_neighbour_link = True
338                                                       else: # a does know distance to b
339    496263   2390534838   4817.1      0.1                  (node_distance_b_a, next_node) = self.node_distances[node_a][node_b]
340    496263   2058112872   4147.2      0.1                  if(node_distance_b_a > neighbour_distance_b_a): # neighbour distance is shorter
341        81       331794   4096.2      0.0                      use_neighbour_link = True
342    496182   2665644192   5372.3      0.1                  elif((None == next_node) and (float('+inf') == neighbour_distance_b_a)): # direct route that has just broken
343        75       313623   4181.6      0.0                      use_neighbour_link = True
344                                                               
345    512120   1992514932   3890.7      0.1              if(use_neighbour_link):
346     16013     78149007   4880.3      0.0                  self.node_distances[node_a][node_b] = (neighbour_distance_b_a, None)
347     16013     83489949   5213.9      0.0                  self.nodes_changed.add(node_a)
348                                                           
349                                                           ## Affinity distances update
350     16013     86020794   5371.9      0.0                  if((node_a.type == Atom.BINDING_SITE) and (node_b.type == Atom.BINDING_SITE)):
351       164      3950487  24088.3      0.0                      self.add_affinityDistance(node_a, node_b, self.chemistry.affinity(node_a.data, node_b.data))     
352                                                   
353                                                   # a sends its table to all its immediate neighbours
354    737753   3549685140   4811.5      0.1          for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems():
355    512120   2129343210   4157.9      0.1              node_b_changed = False
356                                               
357                                                       # b integrates a's distance table with its own
358    512120   2203821081   4303.3      0.1              node_b_chemical = node_b.chemical
359    512120   2409257898   4704.5      0.1              node_b_distances = node_b_chemical.node_distances[node_b]
360                                                       
361                                                       # For all b's routes (to c) that go to a first, update their distances
362  41756882 183992040153   4406.3      7.6              for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it's ok to modify items while iterating over them (just not insert/delete) (seems to work ok)
363  41244762 172425596985   4180.5      7.1                  if(node_after_b == node_a):
364                                                               
365  16673654  64255631616   3853.7      2.7                      try:
366  16673654  88781802534   5324.7      3.7                          distance_b_a_c = neighbour_distance_b_a + self.node_distances[node_a][node_c][0]
367    187083    929898684   4970.5      0.0                      except KeyError:
368    187083   1056787479   5648.8      0.0                          distance_b_a_c = float('+inf')
369                                                                   
370  16673654  69374705256   4160.7      2.9                      if(distance_b_c != distance_b_a_c): # a's distance to c has changed
371    710083   3136751361   4417.4      0.1                          node_b_distances[node_c] = (distance_b_a_c, node_a)
372    710083   2848845276   4012.0      0.1                          node_b_changed = True
373                                                                   
374                                                                   ## Affinity distances update
375    710083   3484577241   4907.3      0.1                          if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)):
376     99592   1591029009  15975.5      0.1                              node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data))
377                                                                   
378                                                               # If distance got longer, then ask b's neighbours to update
379                                                               ## TODO: document this!
380  16673654  70998570837   4258.1      2.9                      if(distance_b_a_c > distance_b_c):
381                                                                   #for (node, neighbour_distance) in node_b_chemical.neighbours[node_b].iteritems():
382   1702852   7413182064   4353.4      0.3                          for node in node_b_chemical.neighbours[node_b]:
383   1204903   5912053272   4906.7      0.2                              node.chemical.nodes_changed.add(node)
384                                                       
385                                                       # Look for routes from a to c that are quicker than ones b knows already
386  42076729 184216680432   4378.1      7.6              for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems():
387                                                           
388  41564609 171150289218   4117.7      7.1                  node_b_update = False
389  41564609 172040284089   4139.1      7.1                  if(node_c == node_b): # a-b path
390    512120   2040112548   3983.7      0.1                      pass
391  41052489 169406668962   4126.6      7.0                  elif(node_after_a == node_b): # a-b-a-b path
392  16251407  63918804600   3933.1      2.6                      pass
393  24801082 101577038778   4095.7      4.2                  elif(node_c in node_b_distances): # b can already get to c
394  24004846 103404357180   4307.6      4.3                      (distance_b_c, node_after_b) = node_b_distances[node_c]
395  24004846 102717271836   4279.0      4.2                      if(node_after_b != node_a): # b doesn't already go to a first
396   7518275  31858204500   4237.4      1.3                          distance_b_a_c = neighbour_distance_b_a + distance_a_c
397   7518275  33470022717   4451.8      1.4                          if(distance_b_a_c < distance_b_c): # quicker to go via a
398    225357    956440656   4244.1      0.0                              node_b_update = True
399                                                           else: # b can't already get to c
400    796236   3415455549   4289.5      0.1                      distance_b_a_c = neighbour_distance_b_a + distance_a_c
401    796236   3412145520   4285.3      0.1                      if(distance_b_a_c < cutoff_distance): # not too for to go
402    593352   2514800052   4238.3      0.1                          node_b_update = True
403                                                                   
404                                                           ## Affinity distances update
405  41564609 164585250189   3959.7      6.8                  if node_b_update:
406    818709   3933555120   4804.6      0.2                      node_b_distances[node_c] = (distance_b_a_c, node_a)
407    818709   4151464335   5070.7      0.2                      if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)):
408    104293   1704446289  16342.9      0.1                          node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data))
409    818709   3557529531   4345.3      0.1                      node_b_changed = True
410                                                       
411                                                       # If any of node b's rows have exceeded the cutoff distance, then remove them
412  42350234 197075504439   4653.5      8.1              for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary
413  41838114 180297579789   4309.4      7.4                  if(distance_b_c > cutoff_distance):
414    206296    894881754   4337.9      0.0                      del node_b_distances[node_c]
415    206296    860508045   4171.2      0.0                      node_b_changed = True
416                                                               
417                                                               ## Affinity distances update
418    206296   4698692217  22776.5      0.2                      node_b_chemical.del_affinityDistance(node_b, node_c)
419                                                       
420                                                       # If we've modified node_b's distance table, tell its chemical to update accordingly
421    512120   2130466347   4160.1      0.1              if(node_b_changed):
422    217858   1201064454   5513.1      0.0                  node_b_chemical.nodes_changed.add(node_b)
423                                                   
424                                                   # Remove any neighbours that have infinite distance (have just unbound)
425                                                   ## TODO: not sure what difference it makes to do this here rather than above (after updating self.node_distances for neighbours)
426                                                   ##       but doing it above seems to break the walker's movement
427    737753   3830386968   5192.0      0.2          for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].items(): # Can't use iteritems() here, as deleting from the dictionary
428    512120   2249770068   4393.1      0.1              if(neighbour_distance_b_a > cutoff_distance):
429       150       747747   4985.0      0.0                  del self.neighbours[node_a][node_b]
430                                                           
431                                                           ## Affinity distances update
432       150      2148813  14325.4      0.0                  self.del_affinityDistance(node_a, node_b)

我的代码说明:

此功能维护一个稀疏距离矩阵,该矩阵表示(很大)网络中节点之间的网络距离(最短路径上的边缘权重之和).要使用完整表格并使用 Floyd-Warshall算法非常慢. (我首先尝试了此方法,它比当前版本慢了几个数量级.)因此我的代码使用稀疏矩阵表示全距离矩阵的阈值版本(距离大于200个单位的任何路径都将被忽略).网络拓扑会随时间变化,因此此距离矩阵需要随时间更新.为此,我正在使用距离矢量路由协议的粗略实现:网络中的节点知道到彼此的距离以及路径上的下一个节点的距离.发生拓扑更改时,与此更改关联的节点会相应地更新其距离表,并告知其直接邻居.信息通过节点将其距离表发送给邻居,节点再将其距离表更新并将其传播到邻居,从而通过网络传播.

This function maintains a sparse distance matrix representing the network distance (sum of edge weights on the shortest path) between nodes in a (very big) network. To work with the complete table and use the Floyd-Warshall algorithm would be very slow. (I tried this first, and it was orders of magnitude slower than the current version.) So my code uses a sparse matrix to represent a thresholded version of the full distance matrix (any paths with a distance greater than 200 units are ignored). The network topolgy changes over time, so this distance matrix needs updating over time. To do this, I am using a rough implementation of a distance-vector routing protocol: each node in the network knows the distance to each other node and the next node on the path. When a topology change happens, the node(s) associated with this change update their distance table(s) accordingly, and tell their immediate neighbours. The information spreads through the network by nodes sending their distance tables to their neighbours, who update their distance tables and spread them to their neighbours.

有一个表示距离矩阵的对象:self.node_distances.这是将节点映射到路由表的字典.节点是我定义的对象.路由表是将节点映射到(距离,next_node)元组的字典.距离是从node_a到node_b的图形距离,next_node是在node_a和node_b之间的路径上必须首先到达的node_a的邻居. next_node为None表示node_a和node_b是图邻居.例如,距离矩阵的样本可能是:

There is an object representing the distance matrix: self.node_distances. This is a dictionary mapping nodes to routing tables. A node is an object that I've defined. A routing table is a dictionary mapping nodes to tuples of (distance, next_node). Distance is the graph distance from node_a to node_b, and next_node is the neighbour of node_a that you must go to first, on the path between node_a and node_b. A next_node of None indicates that node_a and node_b are graph neighbours. For example, a sample of a distance matrix could be:

self.node_distances = { node_1 : { node_2 : (2.0, None),
                                   node_3 : (5.7, node_2),
                                   node_5 : (22.9, node_2) },
                        node_2 : { node_1 : (2.0, None),
                                   node_3 : (3.7, None),
                                   node_5 : (20.9, node_7)},
                        ...etc...

由于拓扑结构的变化,两个相距很远(或根本没有连接)的节点可能会变得很近.发生这种情况时,会将条目添加到此矩阵.由于存在阈值,两个节点之间的距离可能会变得太远而无法处理.发生这种情况时,将从此矩阵中删除条目.

Because of topology changes, two nodes that were far apart (or not connected at all) can become close. When this happens, entries are added to this matrix. Because of the thresholding, two nodes can become too far apart to care about. When this happens, entries are deleted from this matrix.

self.neighbours矩阵与self.node_distances相似,但是包含有关网络中直接链接(边缘)的信息.通过化学反应,self.neighbours在外部不断地对此功能进行修饰.这是网络拓扑变化的来源.

The self.neighbours matrix is similar to self.node_distances, but contains information about the direct links (edges) in the network. self.neighbours is continually being modified externally to this function, by the chemical reaction. This is where the network topology changes come from.

我遇到的实际功能:propagate_distances_node()执行距离的一步-矢量路由协议.给定节点node_a,该函数可确保node_a的邻居正确位于距离矩阵中(拓扑更改).然后,该函数将node_a的路由表发送到网络中所有node_a的直接邻居.它将node_a的路由表与每个邻居自己的路由表集成在一起.

The actual function that I'm having problems with: propagate_distances_node() performs one step of the distance-vector routing protocol. Given a node, node_a, the function makes sure that node_a's neighbours are correctly in the distance matrix (topology changes). The function then sends node_a's routing table to all of node_a's immediate neighbours in the network. It integrates node_a's routing table with each neighbour's own routing table.

在我的程序的其余部分中,重复调用propagate_distances_node()函数,直到距离矩阵收敛为止.自上次更新以来更改了路由表的节点的集合self.nodes_changed被维护.在算法的每次迭代中,都会选择这些节点的随机子集,并在它们上调用propagate_distances_node().这意味着节点异步地和随机地分布其路由表.当集合self.nodes_changed变为空时,该算法收敛于真实距离矩阵.

In the rest of my program, the propagate_distances_node() function is called repeatedly, until the distance matrix converges. A set, self.nodes_changed, is maintained, of the nodes that have changed their routing table since they were last updated. On every iteration of my algorithm, a random subset of these nodes are chosen and propagate_distances_node() is called on them. This means the nodes spread their routing tables asynchronously and stochastically. This algorithm converges on the true distance matrix when the set self.nodes_changed becomes empty.

亲和距离"部分(add_affinityDistancedel_affinityDistance)是距离矩阵的(小的)子矩阵的缓存,该子矩阵由程序的不同部分使用.

The "affinity distances" parts (add_affinityDistance and del_affinityDistance) are a cache of a (small) sub-matrix of the distance matrix, that is used by a different part of the program.

之所以这样做,是因为我正在模拟参与反应的化学物质的计算类似物,这是我的博士学位的一部分. 化学"是原子"的图(图中的节点).结合在一起的两种化学物质被模拟为它们的两个图被新的边缘连接在一起.发生化学反应(通过与此处无关的复杂过程),从而改变了图形的拓扑.但是反应中发生的情况取决于组成化学物质的不同原子相距多远.因此,对于模拟中的每个原子,我想知道它靠近哪个其他原子.稀疏的阈值距离矩阵是存储此信息的最有效方法.由于网络的拓扑随着反应的发生而改变,因此我需要更新矩阵. 距离矢量路由协议是我想到的最快方法.我不需要更复杂的路由协议,因为路由循环之类的事情在我的特定应用程序中不会发生(因为我的化学物质的结构).我随机执行此操作的原因是,我可以将化学反应过程与距离扩展交织在一起,并模拟随着反应发生而随时间逐渐变化的化学形状(而不是立即改变形状).

The reason I'm doing this is that I'm simulating computational analogues of chemicals participating in reactions, as part of my PhD. A "chemical" is a graph of "atoms" (nodes in the graph). Two chemicals binding together is simulated as their two graphs being joined by new edges. A chemical reaction happens (by a complicated process that isn't relevant here), changing the topology of the graph. But what happens in the reaction depends on how far apart the different atoms are that make up the chemicals. So for each atom in the simulation, I want to know which other atoms it is close to. A sparse, thresholded distance matrix is the most efficient way to store this information. Since the topology of the network changes as the reaction happens, I need to update the matrix. A distance-vector routing protocol is the fastest way I could come up with of doing this. I don't need a more compliacted routing protocol, because things like routing loops don't happen in my particular application (because of how my chemicals are structured). The reason I'm doing it stochastically is so that I can interleve the chemical reaction processes with the distance spreading, and simulate a chemical gradually changing shape over time as the reaction happens (rather than changing shape instantly).

此功能中的self是代表化学物质的对象. self.node_distances.keys()中的节点是组成化学物质的原子. self.node_distances[node_x].keys()中的节点是该化学物质的节点,并且可能是该化学物质绑定(并与之反应)的任何化学物质的节点.

The self in this function is an object representing a chemical. The nodes in self.node_distances.keys() are the atoms that make up the chemical. The nodes in self.node_distances[node_x].keys() are nodes from the chemical and potentially nodes from any chemicals that the chemical is bound to (and reacting with).

更新:

我尝试用node_x is node_y替换node_x == node_y的每个实例(根据@Sven Marnach的评论),却减慢了速度! (我没想到这一点!) 我的原始配置文件需要807.234s才能运行,但是经过这次修改,它增加到了895.895s.很抱歉,我的配置文件做错了!我使用的是line_by_line,它(在我的代码中)有太多差异(〜90秒的差异全部来自噪声).正确剖析时,is肯定比==快.使用 CProfile ,我使用==的代码花费了34.394s,但是使用了is,花了33.535秒(我可以确定是没有噪音的.)

I tried replacing every instance of node_x == node_y with node_x is node_y (as per @Sven Marnach's comment), but it slowed things down! (I wasn't expecting that!) My original profile took 807.234s to run, but with this modification it increased to 895.895s. Sorry, I was doing the profiling wrong! I was using line_by_line, which (on my code) had far too much variance (that difference of ~90 seconds was all in the noise). When profiling it properly, is is detinitely faster than ==. Using CProfile, my code with == took 34.394s, but with is, it took 33.535s (which I can confirm is out of the noise).

更新: 现有图书馆

由于我的要求不寻常,因此我不确定是否将有一个现有的库可以满足我的要求: 我需要计算加权无向图中所有节点对之间的最短路径长度.我只关心路径长度小于阈值的路径.在计算路径长度之后,我对网络拓扑进行了少量更改(添加或删除边缘),然后我想重新计算路径长度.与阈值相比,我的图非常大(从给定节点开始,大多数图都比阈值更远),因此拓扑更改不会影响大多数最短路径长度.这就是为什么我使用路由算法的原因:因为这会通过图结构传播拓扑更改信息,所以当它超出阈值时,我就可以停止传播它.即,我不需要每次都重新计算所有路径.我可以使用以前的路径信息(从拓扑更改之前开始)来加快计算速度.这就是为什么我认为我的算法将比最短路径算法的任何库实现都快的原因. 我从未见过在通过物理网络实际路由数据包之外使用路由算法(但如果有人,那么我会很感兴趣).

I'm unsure as to whether there will be an existing library that can do what I want, since my requirements are unusual: I need to compute the shortest-path lengths between all pairs of nodes in a weighted, undirected graph. I only care about path lengths that are lower than a threshold value. After computing the path lengths, I make a small change to the network topology (adding or removing an edge), and then I want to re-compute the path lengths. My graphs are huge compared to the threshold value (from a given node, most of the graph is further away than the threshold), and so the topology changes don't affect most of the shortest-path lengths. This is why I am using the routing algorithm: because this spreads topology-change information through the graph structure, so I can stop spreading it when it's gone further than the threshold. i.e., I don't need to re-compute all the paths each time. I can use the previous path information (from before the topology change) to speed up the calculation. This is why I think my algorithm will be faster than any library implementations of shortest-path algorithms. I've never seen routing algorithms used outside of actually routing packets through physical networks (but if anyone has, then I'd be interested).

NetworkX 由@Thomas K建议. 它具有很多算法,用于计算最短路径. 它具有一种用于计算所有对的最短路径长度的算法带有截止值(这是我想要的),但它仅适用于未加权的图(我的已加权). 不幸的是,其加权图算法不会t允许使用截止值(这可能会使它们对我的图形变慢).而且它的算法似乎都不支持在非常相似的网络(即路由对象)上使用预先计算的路径.

NetworkX was suggested by @Thomas K. It has lots of algorithms for calculating shortest paths. It has an algorithm for computing the all-pairs shortest path lengths with a cutoff (which is what I want), but it only works on unweighted graphs (mine are weighted). Unfortunately, its algorithms for weighted graphs don't allow the use of a cutoff (which might make them slow for my graphs). And none of its algorithms appear to support the use of pre-calculated paths on a very similar network (i.e. the routing stuff).

igraph 是我所知道的另一个图形库,但查看的是其文档,我找不到关于最短路径的任何信息.但是我可能会错过它-它的文档似乎并不十分全面.

igraph is another graph library that I know of, but looking at its documentation, I can't find anything about shortest-paths. But I might have missed it - its documentation doesn't seem very comprehensive.

NumPy 可能.如果我为节点的每个实例分配一个唯一的整数,则可以将稀疏矩阵存储在NumPy数组中.然后,我可以使用整数而不是节点实例来索引NumPy数组.我还将需要两个NumPy数组:一个用于距离,一个用于"next_node"引用.这可能比使用Python字典快(我还不知道).

NumPy might be possible, thanks to @9000's comment. I can store my sparse matrix in a NumPy array if I assign a unique integer to each instance of my nodes. I can then index a NumPy array with integers instead of node instances. I will also need two NumPy arrays: one for the distances and one for the "next_node" references. This might be faster than using Python dictionaries (I don't know yet).

有人知道其他有用的库吗?

Does anyone know of any other libraries that might be useful?

更新:内存使用情况

我正在运行Windows(XP),所以这里有一些有关内存使用情况的信息,来自流程浏览器.因为我有一台双核计算机,所以CPU使用率为50%.

I'm running Windows (XP), so here is some info about memory usage, from Process Explorer. The CPU usage is at 50% because I have a dual-core machine.

我的程序没有用完RAM并开始执行交换操作.您可以从数字和IO图中看到它没有任何活动. IO图上的峰值是程序在屏幕上显示其运行状况的位置.

My program doesn't run out of RAM and start hitting the swap. You can see that from the numbers, and from the IO graph not having any activity. The spikes on the IO graph are where the program prints to the screen to say how it's doing.

但是,随着时间的推移,我的程序确实会占用越来越多的RAM,这可能不是一件好事(但它并没有消耗太多的RAM,这就是为什么直到现在我才注意到它的增加).

However, my program does keep using up more and more RAM over time, which is probably not a good thing (but it's not using up much RAM overall, which is why I didn't notice the increase until now).

并且IO图上尖峰之间的距离随时间增加.这很不好-我的程序每100,000次迭代就会在屏幕上打印一次,所以这意味着每次迭代花费的时间会越来越长.我已经通过长时间运行程序并测量之间的时间来确认这一点.打印语句(程序每10,000次迭代之间的时间).这应该是恒定的,但是从图中可以看出,它呈线性增加...因此存在某些问题. (此图的噪音是因为我的程序使用了大量随机数,因此每次迭代的时间都不同.)

And the distance between the spikes on the IO graph increases over time. This is bad - my program prints to the screen every 100,000 iterations, so that means that each iteration is taking longer to execute as time goes on... I've confirmed this by doing a long run of my program and measuring the time between print statements (the time between each 10,000 iterations of the program). This should be constant, but as you can see from the graph, it increases linearly... so something's up there. (The noise on this graph is because my program uses lots of random numbers, so the time for each iteration varies.)

程序长时间运行后,内存使用情况如下所示(因此肯定不会耗尽RAM):

After my program's been running for a long time, the memory usage looks like this (so it's definitely not running out of RAM):

推荐答案

node_after_b == node_a将尝试调用node_after_b.__eq__(node_a):

>>> class B(object):
...     def __eq__(self, other):
...         print "B.__eq__()"
...         return False
... 
>>> class A(object):
...     def __eq__(self, other):
...         print "A.__eq__()"
...         return False
... 
>>> a = A()
>>> b = B()
>>> a == b
A.__eq__()
False
>>> b == a
B.__eq__()
False
>>> 

在诉诸C之前,尝试使用优化版本覆盖Node.__eq__().

Try to override Node.__eq__() with an optimized version before resorting to C.

更新

我做了这个小实验(python 2.6.6):

I made this little experiment (python 2.6.6):

#!/usr/bin/env python
# test.py
class A(object):
    def __init__(self, id):
        self.id = id

class B(A):
    def __eq__(self, other):
        return self.id == other.id

@profile
def main():
    list_a = []
    list_b = []
    for x in range(100000):
        list_a.append(A(x))
        list_b.append(B(x))

    ob_a = A(1)
    ob_b = B(1)
    for ob in list_a:
        if ob == ob_a:
            x = True
        if ob is ob_a:
            x = True
        if ob.id == ob_a.id:
            x = True
        if ob.id == 1:
            x = True
    for ob in list_b:
        if ob == ob_b:
            x = True
        if ob is ob_b:
            x = True
        if ob.id == ob_b.id:
            x = True
        if ob.id == 1:
            x = True

if __name__ == '__main__':
    main()

结果:

Timer unit: 1e-06 s

File: test.py Function: main at line 10 Total time: 5.52964 s

Line #      Hits         Time  Per Hit % Time  Line Contents
==============================================================
    10                                           @profile
    11                                           def main():
    12         1            5      5.0      0.0      list_a = []
    13         1            3      3.0      0.0      list_b = []
    14    100001       360677      3.6      6.5      for x in range(100000):
    15    100000       763593      7.6     13.8          list_a.append(A(x))
    16    100000       924822      9.2     16.7          list_b.append(B(x))
    17
    18         1           14     14.0      0.0      ob_a = A(1)
    19         1            5      5.0      0.0      ob_b = B(1)
    20    100001       500454      5.0      9.1      for ob in list_a:
    21    100000       267252      2.7      4.8          if ob == ob_a:
    22                                                       x = True
    23    100000       259075      2.6      4.7          if ob is ob_a:
    24                                                       x = True
    25    100000       539683      5.4      9.8          if ob.id == ob_a.id:
    26         1            3      3.0      0.0              x = True
    27    100000       271519      2.7      4.9          if ob.id == 1:
    28         1            3      3.0      0.0              x = True
    29    100001       296736      3.0      5.4      for ob in list_b:
    30    100000       472204      4.7      8.5          if ob == ob_b:
    31         1            4      4.0      0.0              x = True
    32    100000       283165      2.8      5.1          if ob is ob_b:
    33                                                       x = True
    34    100000       298839      3.0      5.4          if ob.id == ob_b.id:
    35         1            3      3.0      0.0              x = True
    36    100000       291576      2.9      5.3          if ob.id == 1:
    37         1            3      3.0      0.0              x = True

我很惊讶:

  • 点"访问(属性)似乎非常昂贵(第25行与第27行).
  • is和'=='之间没有太大区别,至少对于简单对象而言

然后,我尝试使用更复杂的对象,结果与第一个实验一致.

Then I tried with more complex objects and results are consistent with the first experiment.

您交换很多吗?如果您的数据集太大而无法容纳可用的RAM,我想您可能会遇到与虚拟内存获取有关的某种I/O争用.

Are you swapping a lot? If your dataset is so large that it does not fit available RAM, I guess you may experience some kind of I/O contention related to virtual memory fetches.

您正在运行Linux吗?如果是这样,您可以在运行程序时发布计算机的vmstat吗?将类似以下内容的输出发送给我们:

Are you running Linux? If so, could you post a vmstat of your machine while running your program? Send us the output of something like:

vmstat 10 100

祝你好运!

更新(摘自OP的评论)

UPDATE (from comments by OP)

我建议使用sys.setcheckinterval并启用/禁用GC.这样做的理由是,对于这种特殊情况(大量实例),默认的GC引用计数检查有些昂贵,并且其默认间隔太频繁了.

I sugested playing with sys.setcheckinterval and enable/disable the GC. The rationale is that for this particular case (huge number of instances) the default GC reference count check is somewhat expensive and its default interval is away too often.

是的,我以前玩过 sys.setcheckinterval.我将其更改为 1000(默认值是100),但是 没有任何可测量的差异. 禁用垃圾收集有 帮助-谢谢.这一直是 迄今为止最大的提速-节省约 20%(整个运行时间为171分钟, 最多135分钟)-我不确定 误差条是什么,但是 它必须具有统计意义 增加. – 2月9日亚当·内利斯(Adam Nellis)15:10

Yes, I had previously played with sys.setcheckinterval. I changed it to 1000 (from its default of 100), but it didn't do any measurable difference. Disabling Garbage Collection has helped - thanks. This has been the biggest speedup so far - saving about 20% (171 minutes for the whole run, down to 135 minutes) - I'm not sure what the error bars are on that, but it must be a statistically significant increase. – Adam Nellis Feb 9 at 15:10

我的猜测:

我认为Python GC是基于 参考计数.不时地 将检查参考计数 每个实例;因为你是 遍历这些巨大的内存 结构,在您的特定情况下 GC默认频率(1000 周期?)经常离开-巨大 浪费. –您的真实2月10日下午2:06

I think the Python GC is based on reference count. From time to time it will check the reference count for every instance; since you are traversing these huge in-memory structures, in your particular case the GC default frequency (1000 cycles?) is away too often - a huge waste. – Yours Truly Feb 10 at 2:06

这篇关于优化Python字典访问代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆