Networkx从未完成计算2 mil节点的中介中心性 [英] Networkx never finishes calculating Betweenness centrality for 2 mil nodes

查看:584
本文介绍了Networkx从未完成计算2 mil节点的中介中心性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的Twitter用户图,有大约200万个节点和500万个边缘。我正试图玩Centrality。但是,计算需要很长时间(超过一小时)。我不认为我的图表超大,所以我猜我的代码可能有问题。

I have a simple twitter users graph with around 2 million nodes and 5 million edges. I'm trying to play around with Centrality. However, the calculation takes a really long time (more than an hour). I don't consider my graph to be super large so I'm guessing there might be something wrong with my code.

这是我的代码。

%matplotlib inline
import pymongo
import networkx as nx
import time
import itertools

from multiprocessing import Pool
from pymongo import MongoClient

from sweepy.get_config import get_config

config = get_config()

MONGO_URL = config.get('MONGO_URL')
MONGO_PORT = config.get('MONGO_PORT')
MONGO_USERNAME = config.get('MONGO_USERNAME')
MONGO_PASSWORD = config.get('MONGO_PASSWORD')

client = MongoClient(MONGO_URL, int(MONGO_PORT))

db = client.tweets
db.authenticate(MONGO_USERNAME, MONGO_PASSWORD)

users = db.users
graph  = nx.DiGraph()


for user in users.find():
    graph.add_node(user['id_str'])
    for friend_id in user['friends_ids']:
        if not friend_id in graph:
            graph.add_node(friend_id)
        graph.add_edge(user['id_str'], friend_id)

数据在MongoDB中。以下是数据样本。

The data is in MongoDB. Here's the sample of data.

{
    "_id" : ObjectId("55e1e425dd232e5962bdfbdf"),
    "id_str" : "246483486",
    ...
    "friends_ids" : [ 
         // a bunch of ids
    ]
}

我尝试使用中介中心性并行加速,但它仍然超级慢。
https://networkx.github.io/documentation/latest /examples/advanced/parallel_betweenness.html

I tried using betweenness centrality parallel to speed up but it's still super slow. https://networkx.github.io/documentation/latest/examples/advanced/parallel_betweenness.html

"""
Example of parallel implementation of betweenness centrality using the
multiprocessing module from Python Standard Library.

The function betweenness centrality accepts a bunch of nodes and computes
the contribution of those nodes to the betweenness centrality of the whole
network. Here we divide the network in chunks of nodes and we compute their
contribution to the betweenness centrality of the whole network.
"""
def chunks(l, n):
    """Divide a list of nodes `l` in `n` chunks"""
    l_c = iter(l)
    while 1:
        x = tuple(itertools.islice(l_c, n))
        if not x:
            return
        yield x


def _betmap(G_normalized_weight_sources_tuple):
    """Pool for multiprocess only accepts functions with one argument.
    This function uses a tuple as its only argument. We use a named tuple for
    python 3 compatibility, and then unpack it when we send it to
    `betweenness_centrality_source`
    """
    return nx.betweenness_centrality_source(*G_normalized_weight_sources_tuple)


def betweenness_centrality_parallel(G, processes=None):
    """Parallel betweenness centrality  function"""
    p = Pool(processes=processes)
    node_divisor = len(p._pool)*4
    node_chunks = list(chunks(G.nodes(), int(G.order()/node_divisor)))
    num_chunks = len(node_chunks)
    bt_sc = p.map(_betmap,
                  zip([G]*num_chunks,
                      [True]*num_chunks,
                      [None]*num_chunks,
                      node_chunks))

    # Reduce the partial solutions
    bt_c = bt_sc[0]
    for bt in bt_sc[1:]:
        for n in bt:
            bt_c[n] += bt[n]
    return bt_c



print("Computing betweenness centrality for:")
print(nx.info(graph))
start = time.time()
bt = betweenness_centrality_parallel(graph, 2)
print("\t\tTime: %.4F" % (time.time()-start))
print("\t\tBetweenness centrality for node 0: %.5f" % (bt[0]))

从Mongodb到networkx的导入过程相对较快,不到一分钟。

The import process from Mongodb to networkx is relatively fast, less than a minute.

推荐答案

TL / DR:中介中心性是一个非常慢的计算,所以你可能想要通过考虑<$ c $的子集来使用近似度量c> myk 其中 myk 的节点数远远小于网络中的节点数,但大到足以具有统计意义(NetworkX具有一个选项: betweenness_centrality(G,k = myk)

TL/DR: Betweenness centrality is a very slow calculation, so you probably want to use an approximate measure by considering a subset of myk nodes where myk is some number much less than the number of nodes in the network, but large enough to be statistically meaningful (NetworkX has an option for this: betweenness_centrality(G, k=myk).

我并不感到惊讶,这需要很长时间。中介中心性是一个缓慢的计算。 networkx使用的算法是 O(VE)其中 V 是顶点数和 E 边数。在您的情况下 VE = 10 ^ 13 。我希望导入图表以获取 O(V + E)时间,所以如果这花费足够长的时间你可以告诉它不是瞬时的,那么 O(VE)将会很痛苦。

I'm not at all surprised it's taking a long time. Betweenness centrality is a slow calculation. The algorithm used by networkx is O(VE) where V is the number of vertices and E the number of edges. In your case VE = 10^13. I expect importing the graph to take O(V+E) time, so if that is taking long enough that you can tell it's not instantaneous, then O(VE) is going to be painful.

如果减少了1%的节点和1%边缘的网络(所以20,000个节点和50,000个边缘)需要时间X,然后您需要的计算需要10000X。如果X是一秒,那么新计算接近3小时,我认为这是非常乐观的(参见下面的测试)。因此,在您确定代码出现问题之前,请先在较小的网络上运行,然后估算网络的运行时间。

If a reduced network with 1% of the nodes and 1% of the edges (so 20,000 nodes and 50,000 edges) would take time X, then your desired calculation would take take 10000X. If X is one second, then the new calculation is close to 3 hours, which I think is incredibly optimistic (see my test below). So before you decide there's something wrong with your code, run it on some smaller networks and get an estimate of what the run time should be for your network.

一个好的选择是使用近似测量。标准中介度量度量考虑每对节点以及它们之间的路径。 Networkx提供了一种替代方案,它使用 k 节点的随机样本,然后找到那些 k 节点之间的最短路径网络中的其他节点。我认为这应该提供一个加速运行 O(kE)时间

A good alternative is to use an approximate measure. The standard betweenness measure considers every single pair of nodes and the paths between them. Networkx offers an alternative which uses a random sample of just k nodes and then finds shortest paths between those k nodes and all other nodes in the network. I think this should give a speedup to run in O(kE) time

所以你要用的是

betweenness_centrality(G, k=k)

如果您希望对结果的准确性有所限制,可以使用小值 k 进行多次调用,确保它们相对较近然后取平均值。

If you want to have bounds on how accurate your result is, you could do several calls with a smallish value of k, make sure that they are relatively close and then take the average result.

这是我对运行时间的一些快速测试,随机(V,E)=(20,50); (200500);和(2000,5000)

Here's some of my quick testing of run time, with random graphs of (V,E)=(20,50); (200,500); and (2000,5000)

import time
for n in [20,200,2000]:
    G=nx.fast_gnp_random_graph(n, 5./n)
    current_time = time.time()
    a=nx.betweenness_centrality(G)
    print time.time()-current_time

>0.00247192382812
>0.133368968964
>15.5196769238

所以在我的电脑上处理一个比你大0.1%的网络需要15秒。建立与您相同大小的网络大约需要1500万秒。这是1.5 * 10 ^ 7秒,这是pi * 10 ^ 7秒的一半。因为pi * 10 ^ 7秒是一年中非常接近的秒数,所以这需要我的电脑大约6个月。

So on my computer it takes 15 seconds to handle a network that is 0.1% the size of yours. It would take about 15 million seconds to do a network the same size as yours. That's 1.5*10^7 seconds which is a little under half of pi*10^7 seconds. Since pi*10^7 seconds is an incredibly good approximation to the number of seconds in a year, this would take my computer about 6 months.

所以你需要使用近似算法运行。

So you'll want to run with an approximate algorithm.

这篇关于Networkx从未完成计算2 mil节点的中介中心性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆