NetworkX有哪些可扩展性问题? [英] What scalability issues are associated with NetworkX?

查看:186
本文介绍了NetworkX有哪些可扩展性问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对有数百万个节点和数千万个边缘的大型网络的网络分析感兴趣。我希望能够做许多事情,比如解析来自多种格式的网络,查找连接的组件,检测社区,以及运行诸如PageRank等中心性措施。



我被NetworkX吸引,因为它有一个很好的api,很好的文档,并且多年来一直在积极开发中。另外,因为它是用python编写的,所以它应该很快就可以开发出来。



在最近的演示文稿中(幻灯片在github上可用 here ),它声称:


其他工具,NX被设计用于处理与现代问题相关的数据
...... NX中的大多数核心算法依赖于速度极快的遗留代码。


演示文稿还指出,NetworkX的基本算法是在C / Fortran中实现的。然而,从源代码看,它看起来像NetworkX主要是用python编写的。我对源代码并不太熟悉,但我知道有几个例子,其中NetworkX使用numpy来完成繁重的任务(反过来使用C / Fortran来完成线性代数)。例如,文件 networkx / networkx / algorithms / centrality / eigenvector.py 使用numpy来计算特征向量。

有没有人知道这种调用像numpy这样的优化库的策略在整个NetworkX中是非常普遍的,还是只有几个算法可以实现呢?还有谁可以描述与NetworkX相关的其他可扩展性问题?



来自NetworkX Lead Programmer的回复
我在NetworkX邮件中提出了这个问题列表中,并且Aric Hagberg回答说:


NetworkX中使用的数据结构适用于扩展到
大问题(例如数据结构是一个邻接表)。
算法具有不同的缩放属性,但您提到的某些元素是可用的(例如,PageRank,连接的元素,边数为线性
复杂度)。



此时NetworkX是纯Python代码。邻接结构
使用Python字典进行编码,以牺牲内存和计算速度为代价提供了很大的灵活性
。大图可能会使
占用很多内存,并且最终会耗尽。



NetworkX确实使用NumPy和SciPy来计算主要基于
的算法关于线性代数。在这种情况下,使用NumPy矩阵或SciPy
稀疏矩阵将图形表示为
(复制)为邻接矩阵。这些算法可以从传统的C和
FORTRAN代码中受益,这些代码在NumPy和SciPY中使用。


解决方案

你的大问题将会是记忆。 Python简单地不能处理数以千万计的对象,而不会在你的类实现中跳过循环。许多对象的内存开销太高,达到2GB,32位代码无法工作。有办法解决它 - 使用插槽,阵列或numpy。它应该是可以的,因为networkx是为了性能而编写的,但是如果有几件事情不起作用,我会检查你的内存使用情况。



至于缩放,算法基本上是唯一与图形有关的事情。如果图形算法做得不对,那么图形算法往往具有很大的缩放比例,并且它们可能会像其他任何语言一样在Python中完成。

I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.

I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.

In a recent presentation (the slides are available on github here), it was claimed that:

Unlike many other tools, NX is designed to handle data on a scale relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.

The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.

However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py uses numpy to calculate eigenvectors.

Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?

Reply from NetworkX Lead Programmer I posed this question on the NetworkX mailing list, and Aric Hagberg replied:

The data structures used in NetworkX are appropriate for scaling to large problems (e.g. the data structure is an adjacency list). The algorithms have various scaling properties but some of the ones you mention are usable (e.g. PageRank, connected components, are linear complexity in the number of edges).

At this point NetworkX is pure Python code. The adjacency structure is encoded with Python dictionaries which provides great flexibility at the expense of memory and computational speed. Large graphs will take a lot of memory and you will eventually run out.

NetworkX does use NumPy and SciPy for algorithms that are primarily based on linear algebra. In that case the graph is represented (copied) as an adjacency matrix using either NumPy matrices or SciPy sparse matrices. Those algorithms can benefit from the legacy C and FORTRAN code that is used under the hood in NumPy and SciPY.

解决方案

Your big issue will be memory. Python simply cannot handle tens of millions of objects, without jumping through hoops in your class implementation. The memory overhead of many objects is too high, you hit 2GB, and 32 bit code won't work. There's ways around it - using slots, arrays, or numpy. It should be OK, because networkx was written for performance, but if there's a few things that just don't work I'd check your memory usage.

As for scaling, algorithms are basically the only thing that matter with graphs. Graph algorithms tend to have really ugly scaling if they are done wrong, and they are just as likely to be done right in Python as any other language.

这篇关于NetworkX有哪些可扩展性问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆