全路径算法的优化 [英] Optimization of an all-paths algorithm

查看：144 发布时间：2020/5/17 19:06:59 python optimization bioinformatics networkx directed-acyclic-graphs

本文介绍了全路径算法的优化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经成功地使用以下算法在约900个节点的图形上完成了路径长度为10的所有路径数据.但是，我想将其放大到更大的图形，并且想知道是否可以做进一步的优化.到目前为止，我有:

节点完成DFS后，路径将保存到哈希表中.如果遇到上述节点，则会添加哈希表中的路径，因此不会重复工作.
节点按其程度排序(从高到低).这样，最有可能遇到的节点将已经在哈希表中.

具体来说，该算法:建立化学物质(节点)和反应(边缘)的网络并创建路径，以便以后可以更快地搜索到理论路径，然后可以对这些路径进行实验测试.

算法基于networkx all_simple_paths算法:

def _all_simple_paths_graph(DG, cutoff):
    memorizedPaths = {}
    nlist = []
    #sorts nodes into highest -> lowest degree order
    degreelist = sorted(DG.degree_iter(),key=itemgetter(1),reverse=True)
    for i in degreelist:
        t = re.findall("'\s*([^\"]*)\s*'",str(i))
        nlist.extend(t)
    with open ('PythonOutput.txt', "wb") as csvfile:
        writer = csv.writer(csvfile, delimiter=' ', quotechar='|')
        numberdone = 0
        #for each node start a new DFS
        for source in nlist:
            print source 
            print numberdone
            numberdone += 1
            uniqueTreePaths = []
            if cutoff < 1:
                return
            visited = [source]
            stack = [iter(DG[source])]
            while stack:
                children = stack[-1]
                child = next(children, None)
                if child is None:
                    stack.pop()
                    visited.pop()
                #If a node has been searched before, append its paths
                elif child in memorizedPaths:
                    for path in memorizedPaths[child]:
                        newPath = (tuple(visited) + tuple(path))
                        if (len(newPath) <= cutoff) and (len(set(visited) & set(path)) == 0):
                            uniqueTreePaths.append(newPath)
                    continue
                elif len(visited) < cutoff:
                    if child not in visited:
                        visited.append(child)
                        stack.append(iter(DG[child]))
                        if visited not in uniqueTreePaths:
                            uniqueTreePaths.append(tuple(visited))
                else: #len(visited) == cutoff:
                    if (visited not in uniqueTreePaths) and (child not in visited):
                        uniqueTreePaths.append(tuple(visited + [child]))
                    stack.pop()
                    visited.pop()
            #for each node, write to disk to save RAM
            for path in uniqueTreePaths:
                writer.writerow(path)
            #add paths for each node to the hash table
            memorizedPaths[source] = uniqueTreePaths

如果有人对进一步优化算法有任何建议，将不胜感激.

解决方案

首先-测量是您最好的朋友.如果您不收集有关算法花费多长时间的信息，那么您将无法知道更改是否有帮助.缓存结果的想法很聪明，但是无论是否进行缓存，都应检查时间安排，以确保它确实有帮助.

if (visited not in uniqueTreePaths)...特别是您的代码的一部分，我可以看到需要改进的地方.您正在检查列表列表中是否包含列表.我不确定解决此问题的最佳方法是什么(再次从代码中收集计时数据)，但是一种可能性是将路径表示为字符串而不是列表，从而允许它们通过哈希存储. >

I've been successful using the following algorithm to complete all-path data up to path length of 10 on graphs of ~900 nodes. However, I want to scale it up to larger graphs and I'm wondering if there are further optimizations I can do. So far I have:

After a node has completed it's DFS the paths are saved to a hash table. Should said node be encountered, paths from the hash table are appended so work is not repeated.
Nodes are sorted by their degree (highest first). This way nodes most likely to be encountered will already be in the hash table.

The algorithm in specifics: builds a network of chemicals (nodes) and reactions (edges) and creates paths so it can later be searched much faster for theoretical paths, these can later be experimentally tested.

The algo is based on the networkx all_simple_paths algorithm:

def _all_simple_paths_graph(DG, cutoff):
    memorizedPaths = {}
    nlist = []
    #sorts nodes into highest -> lowest degree order
    degreelist = sorted(DG.degree_iter(),key=itemgetter(1),reverse=True)
    for i in degreelist:
        t = re.findall("'\s*([^\"]*)\s*'",str(i))
        nlist.extend(t)
    with open ('PythonOutput.txt', "wb") as csvfile:
        writer = csv.writer(csvfile, delimiter=' ', quotechar='|')
        numberdone = 0
        #for each node start a new DFS
        for source in nlist:
            print source 
            print numberdone
            numberdone += 1
            uniqueTreePaths = []
            if cutoff < 1:
                return
            visited = [source]
            stack = [iter(DG[source])]
            while stack:
                children = stack[-1]
                child = next(children, None)
                if child is None:
                    stack.pop()
                    visited.pop()
                #If a node has been searched before, append its paths
                elif child in memorizedPaths:
                    for path in memorizedPaths[child]:
                        newPath = (tuple(visited) + tuple(path))
                        if (len(newPath) <= cutoff) and (len(set(visited) & set(path)) == 0):
                            uniqueTreePaths.append(newPath)
                    continue
                elif len(visited) < cutoff:
                    if child not in visited:
                        visited.append(child)
                        stack.append(iter(DG[child]))
                        if visited not in uniqueTreePaths:
                            uniqueTreePaths.append(tuple(visited))
                else: #len(visited) == cutoff:
                    if (visited not in uniqueTreePaths) and (child not in visited):
                        uniqueTreePaths.append(tuple(visited + [child]))
                    stack.pop()
                    visited.pop()
            #for each node, write to disk to save RAM
            for path in uniqueTreePaths:
                writer.writerow(path)
            #add paths for each node to the hash table
            memorizedPaths[source] = uniqueTreePaths

If anyone has any suggestions for further optimizing the algorithm, it would be greatly appreciated.

解决方案

First of all - measurement is your best friend. If you are not collecting information about how long the algorithm takes, then you have no real way of knowing if a change helps or not. Your idea to cache your results is clever, but you should check timings with and without the caching to make sure it actually helps.

One part of your code in particular that I can see room for improvement in is if (visited not in uniqueTreePaths).... You are checking to see if a list is contained in a list of lists. I'm not sure what the best way to fix this would be (again, collect timing data from your code), but one possibility would be to represent the paths as strings instead of lists, allowing them to be stored by hashes.

这篇关于全路径算法的优化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

全路径算法的优化 [英] Optimization of an all-paths algorithm

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

全路径算法的优化 [英] Optimization of an all-paths algorithm

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭