查找从叶子到森林中每个节点的所有路径 [英] Find all paths from leafs to each node in a forest

查看:40
本文介绍了查找从叶子到森林中每个节点的所有路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我分部分问了这个问题,因为我没有足够的信息,但现在我有,我可以问完整的问题.所以我在有 2 列的文本文件中有数据.第一列是前身,第二列是后继.我使用以下代码加载数据:

I asked this question in parts, because I didn't have enough infromations, but now that I have, I can ask the full question. So I have data in text file which has 2 columns. First column is a predecessor and second is a successor. I load the data using this code:

[line.split() for line in open('data.txt', encoding ='utf-8')]

假设我们的数据在文件中是这样的:

Lets say that our data looks like this in a file:

ANALYTICAL_BALANCE BFG_DEPOSIT 
CUSTOMER_DETAIL BALANCE 
BFG_2056 FFD_15 
BALANCE BFG_16 
BFG_16 STAT_HIST 
ANALYTICAL_BALANCE BFG_2056 
CUSTOM_DATA AND_11 
AND_11 DICT_DEAL 
DICT_DEAL BFG_2056 

加载后

[[ANALYTICAL_BALANCE,BFG_DEPOSIT],
[CUSTOMER_DETAIL,BALANCE],
[BFG_2056, FFD_15],
[BALANCE,BFG_16],
[BFG_16,STAT_HIST],
[ANALYTICAL_BALANCE,BFG_2056],
[CUSTOM_DATA,AND_11],
[AND_11,DICT_DEAL],
[DICT_DEAL,BFG_2056]]

然后我想连接这个数据.我创建了邻接表:

Then I want to connect this data. I create the adjency list:

def create_adj(edges):
    adj = {}   # or use defaultdict(list) to avoid `if` in the loop below
    for a, b in edges:
        if not a in adj:
            adj[a] = []
        if not b in adj:
            adj[b] = []
        adj[a].append(b)
    return adj

并获取所有路径:

def all_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])

    for node in adj:
        yield from recur([node])

所以输出看起来像这样:

so the output looks like this:

data = [
    ["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
    ["CUSTOMER_DETAIL","BALANCE"],
    ["BFG_2056", "FFD_15"],
    ["BALANCE","BFG_16"],
    ["BFG_16","STAT_HIST"],
    ["ANALYTICAL_BALANCE","BFG_2056"],
    ["CUSTOM_DATA","AND_11"],
    ["AND_11","DICT_DEAL"],
    ["DICT_DEAL","BFG_2056"]
]

adj = create_adj(data)

print([path for path in all_paths(adj) if len(path) > 1])


[ANALYTICAL_BALANCE,BFG_DEPOSIT]
[CUSTOMER_DETAIL,BALANCE,BFG_16,STAT_HIST]
[BFG_2056,FFD_15]
[BALANCE,BFG_16,STAT_HIST]
[ANALYTICAL_BALANCE,BFG_2056,FFD_15]
[CUSTOM_DATA,AND_11,DICT_DEAL,BFG_2056,FFD_15]
[AND_11,DICT_DEAL,BFG_2056,FFD_15]
[DICT_DEAL,BFG_2056,FFD_15]

我们可以将连接可视化为创建森林的单独树.由于输入数据的性质,树不会有任何循环.

We can visualize the connections as a separate trees which creates the forest. The trees won't have any cycles, because of nature of the input data.

现在我的问题是.如何获得每棵树从叶子到每个节点的每个连接?我的意思是什么.我们有 3 棵树,所以我将从最上面的一棵树开始.

Now my question is. How can I get every connection from leaf to every node for every tree ? What I mean by that. We have 3 trees so I will start from the top one.

Tree1:
ANALYTICAL_BALANCE BFG_DEPOSIT

Tree2:
ANALYTICAL_BALANCE BFG_2056
ANALYTICAL_BALANCE FFD_15
CUSTOM_DATA AND_11
CUSTOM_DATA DICT_DEAL
CUSTOM_DATA BFG_2056
CUSTOM_DATA FFD_15

Tree3:
CUSTOMER_DETAIL BALANCE
CUSTOMER_DETAIL BFG_16
CUSTOMER_DETAIL STAT_HIST

如您所见,我的第一次尝试是创建邻接列表并找到所有路径.然后我将删除非叶子节点之间的连接并过滤数据.输入 150 行没问题,但是当我输入包含 13k 行的完整文件时,代码编译了 2 天,没有任何结束的迹象.所以我正在为我的工作寻找最有效的代码或算法以及最好的数据类型(列表、数据帧等).任何帮助将不胜感激,因为我现在已经与它斗争了几天,我无法找到有关如何解决此问题的任何想法.如果有什么不清楚的,我会编辑帖子.
数据将使用 openpyxl 保存到 excel 文件中,因此当我按后继过滤时我可以看到前驱列中与此后继连接的每个叶子.
这是我的全部代码.

As you can see, my first try was to create list of adjacencies and find all paths. Then I would delete the connections beetwen the nodes that are not leafs and filter the data. It was fine for input of 150 rows, but when I inputted the full file with 13k rows the code was compiling for 2 days without any signs of coming to an end. So I'm looking for most efficient code or algorithm as well as best data type for my job(Lists, Data Frames etc. ). Any help would be greatly appreciated, because I'm fighting with it for a few days now and I can't find any idea on how to solve this problem. If something is not clear I will edit the post.
The data will be saved into excel file with openpyxl so when I filter by successor I can see every leaf in predecessor column that is connected to this successor.
Here is my whole code.

# create adjacencies
def create_adj(edges):
    adj = {}
    for a, b in edges:
        if not a in adj:
            adj[a] = []
        if not b in adj:
            adj[b] = []
        adj[a].append(b)
    return adj
 
# find all paths
def all_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])
 
    for node in adj:
        yield from recur([node])
 
# delete the  connections from list 
def conn_deletion(list, list_deletion):
    after_del = [x for x in list if x[0] not in list_deletion]
    return after_del
 
 # get paths where len of path is > 2 and save them as a leaf to node. Also save connections to deletion.
def unpack_paths(my_list):
    list_of_more_succ = []
    to_deletion = []
    for item in my_list:
        if len(item) == 1:
            print("len 1",item)
            to_deletion.append(item[0])
        elif len(item) > 2:
            for i in range(1, len(item) - 1):
                to_deletion.append(item[i])
                print("len > 2", item[i])
                if [item[0], item[i]] in list_of_more_succ:
                    pass
                else:
                    list_of_more_succ.append([item[0], item[i]])
    list_concat = my_list + list_of_more_succ
    sorted_list = list(k for k, _ in itertools.groupby(list_concat))
    final = conn_deletion(sorted_list, list(dict.fromkeys(to_deletion)))
    return final
 
 
data = [line.split() for line in open('data.txt', encoding='utf-8')]
 
adj = create_adj(data)
print(adj)
 
workbook = Workbook()
sheet = workbook.active
sheet["A1"] = "Source"
sheet["B1"] = "Child"
 
loaded = list(all_paths(adj))
 
final_edited = unpack_paths(loaded)
 
# Save data to excel file. We don't want paths with len == 1 or > 2.
for row, item in enumerate(final_edited, start=2):
    if len(item) > 2:
        pass
    elif len(item) == 1:
        pass
    else:
        sheet[f"A{row}"] = item[0]
        sheet[f"B{row}"] = item[1]
workbook.save("DataMap.xlsx")

推荐答案

我建议将 all_paths 更改为 leaf_paths,这意味着它只会产生那些开始于一片叶子.

I would suggest changing all_paths to leaf_paths, meaning that it would only yield those paths that start at a leaf.

然后使用这些路径:

  • 确定它指向的根(路径中的最后一个元素)
  • 识别叶子(路径中的第一个元素)
  • 迭代该路径中的所有非叶子,并将它们中的每一个与叶子组合成对.
  • 将这些对存储在以根为键的字典中

以下是您如何在标有注释的两个位置更改 all_paths:

Here is how you would alter all_paths at two places marked with a comment:

def leaf_paths(adj):
    def recur(path):
        node = path[-1]
        neighbors = [neighbor for neighbor in adj[node] if not neighbor in path]
        if not neighbors:
            yield path
        for neighbor in neighbors:
            yield from recur(path + [neighbor])

    # identify the internal nodes (not leaves)
    internals = set(parent for parents in adj.values() for parent in parents) 
    for node in adj:
        if not node in internals:  # require that the starting node is a leaf
            yield from recur([node])

然后添加这个函数:

def all_leaf_pairs(paths):
    trees = {}

    for path in paths:
        if len(path) > 1:
            root = path[-1]
            if not root in trees:
                trees[root] = []
            it = iter(path)
            leaf = next(it)
            trees[root].extend((leaf, node) for node in it)
    return trees

你的主程序会这样做:

data = [
    ["ANALYTICAL_BALANCE","BFG_DEPOSIT"],
    ["CUSTOMER_DETAIL","BALANCE"],
    ["BFG_2056", "FFD_15"],
    ["BALANCE","BFG_16"],
    ["BFG_16","STAT_HIST"],
    ["ANALYTICAL_BALANCE","BFG_2056"],
    ["CUSTOM_DATA","AND_11"],
    ["AND_11","DICT_DEAL"],
    ["DICT_DEAL","BFG_2056"]
]

adj = create_adj(data)
paths = leaf_paths(adj)

import pprint
pprint.pprint(all_leaf_pairs(paths))

这将输出:

{'BFG_DEPOSIT': [('ANALYTICAL_BALANCE', 'BFG_DEPOSIT')],
 'FFD_15': [('ANALYTICAL_BALANCE', 'BFG_2056'),
            ('ANALYTICAL_BALANCE', 'FFD_15'),
            ('CUSTOM_DATA', 'AND_11'),
            ('CUSTOM_DATA', 'DICT_DEAL'),
            ('CUSTOM_DATA', 'BFG_2056'),
            ('CUSTOM_DATA', 'FFD_15')],
 'STAT_HIST': [('CUSTOMER_DETAIL', 'BALANCE'),
               ('CUSTOMER_DETAIL', 'BFG_16'),
               ('CUSTOMER_DETAIL', 'STAT_HIST')]}

leaf_paths的说明

这个函数使用递归.它在其范围内定义了 recur.

但主要代码从识别内部节点(即至少有一个子节点的节点)开始.由于 adj 提供给定节点的父节点,我们只需要收集所有这些父节点.

But the main code starts with identifying the internal nodes (i.e. the nodes that have at least one child). Since adj provides the parent(s) for a given node, we just have to collect all those parents.

我们使用这组内部节点来确保我们只在叶节点上开始递归,因为在输出中我们希望路径总是以叶节点开始.

We use this set of internal nodes to make sure we start the recursion only on leaf nodes, as in the output we want to have paths that always start out with a leaf node.

recur 函数将从给定的叶子走向它可以找到的任何根.它使用它可以找到的下一个父级(neighbor)扩展路径并执行递归,直到没有更多的父级(即,它是一个根).当这种情况发生时,累积的路径(以叶子开始并以根结束)yield.

The recur function will walk from the given leaf towards any root it can find. It extends the path with the next parent it can find (neighbor) and performs recursion until there is no more parent (i.e., it is a root). When that happens the accumulated path (that starts with a leaf and ends with a root) is yielded.

leaf_paths 本身产生任何 recur 产生的路径.

leaf_paths itself yields any path that recur yields.

这篇关于查找从叶子到森林中每个节点的所有路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆