如何在python中的无向图中有效计算三合会普查 [英] How to efficiently calculate triad census in undirected graph in python

查看:115
本文介绍了如何在python中的无向图中有效计算三合会普查的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的undirected network如下计算triad census.

import networkx as nx
G = nx.Graph()
G.add_edges_from(
    [('A', 'B'), ('A', 'C'), ('D', 'B'), ('E', 'C'), ('E', 'F'),
     ('B', 'H'), ('B', 'G'), ('B', 'F'), ('C', 'G')])

from itertools import combinations
#print(len(list(combinations(G.nodes, 3))))

triad_class = {}
for nodes in combinations(G.nodes, 3):
    n_edges = G.subgraph(nodes).number_of_edges()
    triad_class.setdefault(n_edges, []).append(nodes)
print(triad_class)

它适用于小型网络.但是,现在我有一个较大的网络,大约有4000-8000个节点.当我尝试使用1000个节点的网络运行现有代码时,需要花费几天的时间.有更有效的方法吗?

我当前的网络大部分是稀疏的.即,节点之间只有很少的连接.在那种情况下,我可以离开未连接的节点,先进行计算,然后再将未连接的节点添加到输出中吗?

我也很高兴获得近似答案,而无需计算每个组合.

黑社会人口普查示例:

三合会人口普查将三合会(3个节点)分为下图所示的四个类别.

例如,考虑下面的网络.

这四个阶级的三合会人口普查是

{3: [('A', 'B', 'C')], 
2: [('A', 'B', 'D'), ('B', 'C', 'D'), ('B', 'D', 'E')], 
1: [('A', 'B', 'E'), ('A', 'B', 'F'), ('A', 'B', 'G'), ('A', 'C', 'D'), ('A', 'C', 'E'), ('A', 'C', 'F'), ('A', 'C', 'G'), ('A', 'D', 'E'), ('A', 'F', 'G'), ('B', 'C', 'E'), ('B', 'C', 'F'), ('B', 'C', 'G'), ('B', 'D', 'F'), ('B', 'D', 'G'), ('B', 'F', 'G'), ('C', 'D', 'E'), ('C', 'F', 'G'), ('D', 'E', 'F'), ('D', 'E', 'G'), ('D', 'F', 'G'), ('E', 'F', 'G')], 
0: [('A', 'D', 'F'), ('A', 'D', 'G'), ('A', 'E', 'F'), ('A', 'E', 'G'), ('B', 'E', 'F'), ('B', 'E', 'G'), ('C', 'D', 'F'), ('C', 'D', 'G'), ('C', 'E', 'F'), ('C', 'E', 'G')]}

如果需要,我很乐意提供更多详细信息.

我能够通过按照答案中的建议注释行#print(len(list(combinations(G.nodes, 3))))来解决memory error的问题.但是,我的程序仍然很慢,即使有1000个节点的网络也要花几天的时间才能运行.我正在寻找在python中执行此操作的更有效方法.

我不仅限于networkx,并且很高兴也接受使用其他库和语言的答案.

一如既往,我很乐意根据需要提供更多详细信息.

解决方案

想法很简单:我不直接使用图,而是使用邻接矩阵.我以为这样会更有效率,看来我是对的.

在邻接矩阵中,a 1表示两个节点之间存在一条边,例如,第一行可以读取为"A和B以及C之间都有链接"

从那里,我查看了您的四种类型,并发现了以下内容:

  • 对于类型3,在N1和N2,N1和N3之间以及N2和N3之间必须有一条边.在邻接矩阵中,我们可以通过遍历每一行(其中每一行代表一个节点及其连接,这是N1)并找到与其连接的节点(即N2)来找到它.然后,在N2的行中,我们检查所有连接的节点(这是N3),并保留那些在N1的行中存在正条目的节点.例如,"A,B,C",A与B有连接.B与C有连接,A也与C有连接

  • 类型2的
  • 几乎与类型3相同.除了现在,我们要在N1行的N3列中找到0.例如"A,B,D". A与B有连接,B在D列中有1,但A没有.

  • 对于类型1,我们仅查看N2的行并找到所有N1行和N2行都为0的列.

  • 最后,对于类型0,请查看N1行中条目为0的所有列,然后检查其中的行,并找到所有具有0的列.

此代码应为您工作.对于1000个节点,(在装有i7-8565U CPU的计算机上)花了大约7分钟的时间,这仍然相对较慢,但与当前运行解决方案所需的几天时间相去甚远.我已经在您的图片中包含了示例,以便您可以验证结果.您的代码生成的图形与您在下面显示的示例不同.代码中的示例图和邻接矩阵均引用您包含的图片.

具有1000个节点的示例使用 networkx.generators.random_graphs.fast_gnp_random_graph . 1000是节点数,0.1是创建边的概率,而种子只是为了保持一致性.我已经设置了创建边缘的可能性,因为您提到您的图形稀疏.

networkx.linalg.graphmatrix.adjacency_matrix :如果要使用纯Python邻接矩阵表示,请尝试networkx.convert.to_dict_of_dicts,该字典将返回字典格式的字典,可以将其作为稀疏矩阵来处理."

字典结构具有M个词典(=行),其中最多嵌套有M个词典.请注意,嵌套字典为空,因此检查它们中是否存在键等同于如上所述检查1或0.

import time

import networkx as nx


def triads(m):
    out = {0: set(), 1: set(), 2: set(), 3: set()}
    nodes = list(m.keys())
    for i, (n1, row) in enumerate(m.items()):
        print(f"--> Row {i + 1} of {len(m.items())} <--")
        # get all the connected nodes = existing keys
        for n2 in row.keys():
            # iterate over row of connected node
            for n3 in m[n2]:
                # n1 exists in this row, all 3 nodes are connected to each other = type 3
                if n3 in row:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[3].add(t)
                # n2 is connected to n1 and n3 but not n1 to n3 = type 2
                else:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[2].add(t)
            # n1 and n2 are connected, get all nodes not connected to either = type 1
            for n3 in nodes:
                if n3 not in row and n3 not in m[n2]:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[1].add(t)
        for j, n2 in enumerate(nodes):
            if n2 not in row:
                # n2 not connected to n1
                for n3 in nodes[j+1:]:
                    if n3 not in row and n3 not in m[n2]:
                        # n3 is not connected to n1 or n2 = type 0
                        if len({n1, n2, n3}) == 3:
                            t = tuple(sorted((n1, n2, n3)))
                            out[0].add(t)
    return out


if __name__ == "__main__":
    g = nx.Graph()
    g.add_edges_from(
        [("E", "D"), ("G", "F"), ("D", "B"), ("B", "A"), ("B", "C"), ("A", "C")]
    )
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    print(_out)

    start = time.time()
    g = nx.generators.fast_gnp_random_graph(1000, 0.1, seed=42)
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    end = time.time() - start
    print(end)

I am calculating triad census as follows for my undirected network.

import networkx as nx
G = nx.Graph()
G.add_edges_from(
    [('A', 'B'), ('A', 'C'), ('D', 'B'), ('E', 'C'), ('E', 'F'),
     ('B', 'H'), ('B', 'G'), ('B', 'F'), ('C', 'G')])

from itertools import combinations
#print(len(list(combinations(G.nodes, 3))))

triad_class = {}
for nodes in combinations(G.nodes, 3):
    n_edges = G.subgraph(nodes).number_of_edges()
    triad_class.setdefault(n_edges, []).append(nodes)
print(triad_class)

It works fine with small networks. However, now I have a bigger network with approximately 4000-8000 nodes. When I try to run my existing code with a network of 1000 nodes, it takes days to run. Is there a more efficient way of doing this?

My current network is mostly sparse. i.e. there are only few connections among the nodes. In that case, can I leave the unconnected nodes and do the computation first and later add the unconnceted nodes to the output?

I am also happy to get approximate answers without calculating every combination.

Example of triad census:

Triad census is dividing the triads (3 nodes) in to the four categories shown in the below figure.

For example consider the network below.

The triad census of the four classes are;

{3: [('A', 'B', 'C')], 
2: [('A', 'B', 'D'), ('B', 'C', 'D'), ('B', 'D', 'E')], 
1: [('A', 'B', 'E'), ('A', 'B', 'F'), ('A', 'B', 'G'), ('A', 'C', 'D'), ('A', 'C', 'E'), ('A', 'C', 'F'), ('A', 'C', 'G'), ('A', 'D', 'E'), ('A', 'F', 'G'), ('B', 'C', 'E'), ('B', 'C', 'F'), ('B', 'C', 'G'), ('B', 'D', 'F'), ('B', 'D', 'G'), ('B', 'F', 'G'), ('C', 'D', 'E'), ('C', 'F', 'G'), ('D', 'E', 'F'), ('D', 'E', 'G'), ('D', 'F', 'G'), ('E', 'F', 'G')], 
0: [('A', 'D', 'F'), ('A', 'D', 'G'), ('A', 'E', 'F'), ('A', 'E', 'G'), ('B', 'E', 'F'), ('B', 'E', 'G'), ('C', 'D', 'F'), ('C', 'D', 'G'), ('C', 'E', 'F'), ('C', 'E', 'G')]}

I am happy to provide more details if needed.

EDIT:

I was able to resolve the memory error by commenting the line #print(len(list(combinations(G.nodes, 3)))) as suggested in the answer. However, my program is still slow and takes days to run even with a network of 1000 nodes. I am looking for a more efficient way of doing this in python.

I am not limited to networkx and happy to accept answers using other libraries and languages as well.

As always I am happy to provide more details as needed.

解决方案

The idea is simple: Instead of working on the graph directly I use the adjacency matrix. I thought this would be more efficient, and it seems I was right.

In an adjacency matrix a 1 indicates there is an edge between the two nodes, for example the first row can be read as "There is a link between A and B as well as C"

From there I looked at your four types and found the following:

  • for type 3 there must be an edge between a N1 and N2, N1 and N3 and between N2 and N3. In the adjacency matrix we can find this by going over each row (where each row represents a node and its connections, this is N1) and find nodes it is connected to (that would be N2). Then, in the row of N2 we check all connected nodes (this is N3) and keep those where there is a positive entry in the row of N1. An example of this is "A, B, C", A has a connection to B. B has a connection to C, and A also has a connection to C

  • for type 2 it works almost identical to type 3. Except now we want to find a 0 for the N3 column in the row of N1. An example of this is "A, B, D". A has a connection to B, B has a 1 in the D column, but A does not.

  • for type 1 we just look at the row of N2 and find all columns for which both the N1 row and N2 row have a 0.

  • lastly, for type 0 look at all columns in the N1 row for which the entry is 0, and then check the rows for those, and find all the columns that have a 0 as well.

This code should work for you. For 1000 nodes it took me about 7 minutes (on a machine with a i7-8565U CPU) which is still relatively slow, but a far cry from the multiple days it currently takes you to run your solution. I have included the example from your pictures so you can verify the results. Your code produces a graph that is different from the example you show below by the way. The example graph in the code and the adjacency matrix both refer to the picture you have included.

The example with 1000 nodes uses networkx.generators.random_graphs.fast_gnp_random_graph. 1000 is the number of nodes, 0.1 is the probability for edge creation, and the seed is just for consistency. I have set the probability for edge creation because you mentioned your graph is sparse.

networkx.linalg.graphmatrix.adjacency_matrix: "If you want a pure Python adjacency matrix representation try networkx.convert.to_dict_of_dicts which will return a dictionary-of-dictionaries format that can be addressed as a sparse matrix."

The dictionary structure has M dictionaries (= rows) with up to M dictionaries nested in them. Note that the nested dictionaries are empty so checking for the existence of the key in them is equivalent to checking for a 1 or 0 as described above.

import time

import networkx as nx


def triads(m):
    out = {0: set(), 1: set(), 2: set(), 3: set()}
    nodes = list(m.keys())
    for i, (n1, row) in enumerate(m.items()):
        print(f"--> Row {i + 1} of {len(m.items())} <--")
        # get all the connected nodes = existing keys
        for n2 in row.keys():
            # iterate over row of connected node
            for n3 in m[n2]:
                # n1 exists in this row, all 3 nodes are connected to each other = type 3
                if n3 in row:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[3].add(t)
                # n2 is connected to n1 and n3 but not n1 to n3 = type 2
                else:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[2].add(t)
            # n1 and n2 are connected, get all nodes not connected to either = type 1
            for n3 in nodes:
                if n3 not in row and n3 not in m[n2]:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[1].add(t)
        for j, n2 in enumerate(nodes):
            if n2 not in row:
                # n2 not connected to n1
                for n3 in nodes[j+1:]:
                    if n3 not in row and n3 not in m[n2]:
                        # n3 is not connected to n1 or n2 = type 0
                        if len({n1, n2, n3}) == 3:
                            t = tuple(sorted((n1, n2, n3)))
                            out[0].add(t)
    return out


if __name__ == "__main__":
    g = nx.Graph()
    g.add_edges_from(
        [("E", "D"), ("G", "F"), ("D", "B"), ("B", "A"), ("B", "C"), ("A", "C")]
    )
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    print(_out)

    start = time.time()
    g = nx.generators.fast_gnp_random_graph(1000, 0.1, seed=42)
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    end = time.time() - start
    print(end)

这篇关于如何在python中的无向图中有效计算三合会普查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆