igraph中的社区检测算法之间有什么区别? [英] What are the differences between community detection algorithms in igraph?

查看:166
本文介绍了igraph中的社区检测算法之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约100个igraph对象的列表,其中典型的对象具有大约700个顶点和3500个边.

我想确定更多可能存在联系的顶点组.我的计划是然后使用混合模型,使用顶点和组属性来预测有多少组内关系顶点.

有些人可能想对我项目的其他方面做出回应,这可能很棒,但是我最感兴趣的是有关igraph中用于将顶点分组的函数的信息.我遇到过这些社区检测算法,但我不确定它们的优缺点,或者其他一些功能对我来说是否更好.我也在此处看到了链接,但它们并非特定于igraph.感谢您的建议.

解决方案

以下是有关igraph中当前实现的社区检测算法的简短摘要:

  • edge.betweenness.community是一种层次分解过程,其中,边缘按其边缘中间性得分(即通过给定边缘的最短路径数)的降序删除.这是由于以下事实:连接不同组的边更有可能包含在多个最短路径中,这仅仅是因为在许多情况下它们是从一组移到另一组的唯一选择.该方法产生了很好的结果,但是由于边缘中间性计算的计算复杂性以及每次去除边缘后都必须重新计算中间性得分而非常缓慢.您的具有约700个顶点和约3500个边的图处于图的大小上限附近,可以使用此方法进行分析.另一个缺点是edge.betweenness.community会构建完整的树状图,并且没有为您提供关于在何处切割树状图以获取最终组的任何指导,因此您必须使用其他措施来确定(例如,树状图每个级别的分区).

  • fastgreedy.community是另一种分层方法,但是它是自下而上而不是自上而下的.它试图以贪婪的方式优化称为模块化的质量函数.最初,每个顶点都属于一个单独的社区,并且以迭代方式合并社区,以使每个合并都是局部最优的(即,在模块化的当前值上产生最大的增长).当无法再增加模块性时,算法将停止,因此它为您提供了分组以及树状图.该方法速度快,并且通常将其作为一阶近似方法尝试,因为它没有要调整的参数.但是,已知存在分辨率限制的问题,即低于给定大小阈值的社区(取决于节点和边的数量,如果我没有记错的话,取决于社区的数量)将始终与邻近社区合并.

  • walktrap.community是一种基于随机游动的方法.通常的想法是,如果您在图表上执行随机游走,则游走更有可能停留在同一社区内,因为只有少数几条边通向给定社区之外. Walktrap会进行3-4-5步的短时随机游走(取决于其参数之一),并使用这些随机游走的结果以自下而上的方式(如fastgreedy.community)合并单独的社区.同样,您可以使用模块化得分选择在哪里切割树状图.它比快速贪婪方法要慢一些,但也要准确一些(根据原始出版物).

  • spinglass.community是基于统计学的物理方法,基于所谓的Potts模型.在此模型中,每个粒子(即顶点)可以处于 c 自旋状态之一,并且粒子之间的相互作用(即图形的边缘)指定了哪些对顶点更愿意停留在其中相同的自旋态,哪个更喜欢具有不同的自旋态.然后针对给定数量的步骤对模型进行仿真,并且最终粒子的自旋状态定义了群落.结果如下:1)尽管可以将 c 设置为200,但最终不会有超过 c 个社区.足以满足您的目的. 2)由于某些自旋状态可能变为空,最后可能少于 c 个社区. 3)不能保证网络的完全远程(或断开连接)部分中的节点具有不同的自旋状态.仅对于断开连接的图形,这更有可能是一个问题,因此我不必为此担心.该方法不是特别快速且不确定(由于仿真本身),但是具有可调整的分辨率参数来确定群集大小. Spinglass方法的一种变体也可以考虑否定链接(即,其端点倾向于位于不同社区的链接).

  • leading.eigenvector.community是一种自上而下的分层方法,可再次优化模块化功能.在每个步骤中,将图形分为两部分,以使分隔本身可以显着提高模块性.通过评估所谓的模块化矩阵的前导特征向量来确定拆分,并且还有一个停止条件,该条件阻止紧密连接的组进一步拆分.由于涉及本征向量计算,因此它可能不适用于ARPACK本征向量求解器不稳定的简并图.在非退化图上,它的模块性得分可能比快速贪婪方法要高,尽管它的速度要慢一些.

  • label.propagation.community是一种简单的方法,其中为每个节点分配了 k 个标签之一.然后,该方法迭代进行,并以使每个节点以同步方式获取其邻居的最频繁标签的方式将标签重新分配给节点.当每个节点的标签是其附近最频繁的标签之一时,该方法将停止.它非常快,但是根据初始配置(随机决定)会产生不同的结果,因此,应多次运行该方法(例如,对图形进行1000次),然后建立共识标签,这可能是乏味.

igraph 0.6也将包括基于信息理论原理的最新Infomap社区检测算法;它尝试建立一个为图上的随机游走提供最短描述长度的分组,其中描述长由编码随机游走路径所需的每个顶点的预期位数来衡量.

无论如何,我可能会以fastgreedy.communitywalktrap.community作为第一个近似值,然后在发现这两种方法由于某种原因不适合特定问题时对其他方法进行评估.

I have a list of about 100 igraph objects with a typical object having about 700 vertices and 3500 edges.

I would like to identify groups of vertices within which ties are more likely. My plan is to then use a mixed model to predict how many within-group ties vertices have using vertex and group attributes.

Some people may want to respond to other aspects of my project, which would be great, but the thing I'm most interested in is information about functions in igraph for grouping vertices. I've come across these community detection algorithms but I'm not sure of their advantages and disadvantages, or whether some other function would be better for my case. I saw the links here as well, but they aren't specific to igraph. Thanks for your advice.

解决方案

Here is a short summary about the community detection algorithms currently implemented in igraph:

  • edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).

  • fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.

  • walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).

  • spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).

  • leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.

  • label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.

igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.

Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.

这篇关于igraph中的社区检测算法之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆