NetworkX 有哪些可扩展性问题? [英] What scalability issues are associated with NetworkX?

查看:23
本文介绍了NetworkX 有哪些可扩展性问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对具有数百万个节点和数千万条边的大型网络的网络分析感兴趣.我希望能够执行诸如从多种格式解析网络、查找连接组件、检测社区以及运行诸如 PageRank 之类的中心性度量之类的操作.

I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.

我被 NetworkX 吸引是因为它有一个很好的 API、很好的文档,并且多年来一直在积极开发中.另外,因为它是用python编写的,所以开发起来应该很快.

I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.

在最近的一次演示中(可在 github 此处 上找到幻灯片),声称:

In a recent presentation (the slides are available on github here), it was claimed that:

与许多其他工具不同,NX 旨在大规模处理数据与现代问题相关……NX 中的大多数核心算法都依赖于极快的遗留代码.

Unlike many other tools, NX is designed to handle data on a scale relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.

演示文稿还指出,NetworkX 的基本算法是在 C/Fortran 中实现的.

The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.

不过看源码,貌似NetworkX大多是用python写的.我对源代码不太熟悉,但我知道几个例子,其中 NetworkX 使用 numpy 来做繁重的工作(反过来使用 C/Fortran 来做线性代数).例如,文件 networkx/networkx/algorithms/centrality/eigenvector.py 使用 numpy 来计算特征向量.

However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py uses numpy to calculate eigenvectors.

有谁知道这种调用像 numpy 这样的优化库的策略是否真的在整个 NetworkX 中普遍存在,或者是否只有少数算法这样做?任何人都可以描述与 NetworkX 相关的其他可扩展性问题吗?

Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?

来自 NetworkX 首席程序员的回复我在 NetworkX 邮件列表上提出了这个问题,Aric Hagberg 回复了:

Reply from NetworkX Lead Programmer I posed this question on the NetworkX mailing list, and Aric Hagberg replied:

NetworkX 中使用的数据结构适用于扩展到大问题(例如数据结构是一个邻接表).这算法具有各种缩放属性,但其中一些您提及是可用的(例如 PageRank、连接的组件、是线性的边数的复杂度).

The data structures used in NetworkX are appropriate for scaling to large problems (e.g. the data structure is an adjacency list). The algorithms have various scaling properties but some of the ones you mention are usable (e.g. PageRank, connected components, are linear complexity in the number of edges).

此时 NetworkX 是纯 Python 代码.邻接结构使用 Python 字典编码,提供了极大的灵活性以牺牲内存和计算速度为代价.大图将占用大量内存,最终会耗尽.

At this point NetworkX is pure Python code. The adjacency structure is encoded with Python dictionaries which provides great flexibility at the expense of memory and computational speed. Large graphs will take a lot of memory and you will eventually run out.

NetworkX 确实将 NumPy 和 SciPy 用于主要用于基于线性代数.在这种情况下,图形表示(复制)作为使用 NumPy 矩阵或 SciPy 的邻接矩阵稀疏矩阵.这些算法可以从遗留的 C 和在 NumPy 和 SciPY 中使用的 FORTRAN 代码.

NetworkX does use NumPy and SciPy for algorithms that are primarily based on linear algebra. In that case the graph is represented (copied) as an adjacency matrix using either NumPy matrices or SciPy sparse matrices. Those algorithms can benefit from the legacy C and FORTRAN code that is used under the hood in NumPy and SciPY.

推荐答案

您的大问题将是内存.Python 只是无法处理数以千万计的对象,而无需跳过类实现中的障碍.很多对象的内存开销太高,你打到了 2GB,32 位代码就不行了.有很多方法可以解决它 - 使用插槽、数组或 NumPy.应该没问题,因为networkx是为了性能而写的,但是如果有一些东西不起作用,我会检查你的内存使用情况.

Your big issue will be memory. Python simply cannot handle tens of millions of objects without jumping through hoops in your class implementation. The memory overhead of many objects is too high, you hit 2GB, and 32-bit code won't work. There are ways around it - using slots, arrays, or NumPy. It should be OK because networkx was written for performance, but if there are a few things that don't work, I will check your memory usage.

至于缩放,算法基本上是唯一与图有关的东西.如果图算法做错了,它们往往会非常产生丑陋的缩放比例,而且它们在 Python 中与任何其他语言一样有可能正确完成.

As for scaling, algorithms are basically the only thing that matters with graphs. Graph algorithms tend to have really ugly scaling if they are done wrong, and they are just as likely to be done right in Python as any other language.

这篇关于NetworkX 有哪些可扩展性问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆