从节点列表中提取边缘和社区 [英] Extract edge and communities from list of nodes

查看:120
本文介绍了从节点列表中提取边缘和社区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有超过5万个节点的数据集,我正在尝试从中提取可能的边缘和群落.我确实尝试使用一些图形工具(例如gephi,cytoscape,socnet,nodexl等)来可视化和识别边缘和社区,但对于这些工具而言,节点列表太大.因此,我正在尝试编写脚本以精确定位边缘和社区.其他列是具有GPS位置的连接开始日期时间和结束日期时间.

I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gep cytoscape, socnet, nodexl and so on to visualize and identify the edges and communities but the node list too large for those tools. Hence I am trying to write script to exact the edge and communities. The other columns are connection start datetime and end datetime with GPS locations.

输入:

Id,开始时间,结束时间,gps1,gps2

Id,starttime,endtime,gps1,gps2

0022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00904b14b494,1073260804,1073265163,817558,439525
00904b14b494,1073260804,1073263786,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d1406df,1073260807,1073260878,820428,438735
00022d623dfe,1073260810,1073276346,819251,440006
00022d7317d7,1073260810,1073276155,819251,440006
00022d9064bc,1073260810,1073272525,819251,440006
00022d9064bc,1073260810,1073260999,819251,440006
00022d9064bc,1073260810,1073260857,819251,440006
0030650c9eda,1073260811,1073260813,820356,439224
00022d0e0cec,1073260813,1073262843,820187,439271
00022d176cf3,1073260813,1073260962,817721,439564
000c30d8d2e8,1073260813,1073260902,817721,439564
00904b243bc4,1073260813,1073260962,817721,439564
00904b2fc34d,1073260813,1073260962,817721,439564
00904b52b839,1073260813,1073260962,817721,439564
00904b9a5a51,1073260813,1073260962,817721,439564
00904ba8b682,1073260813,1073260962,817721,439564
00022d3be9cd,1073260815,1073261114,819269,439403
00022d80381f,1073260815,1073261114,819269,439403
00022dc1b09c,1073260815,1073261114,819269,439403
00022d36a6df,1073260817,1073260836,820761,438607
00022d36a6df,1073260817,1073260845,820761,438607
003065d2d8b6,1073260817,1073267560,817735,439757
00904b0c7856,1073260817,1073265149,817735,439757
00022de73863,1073260825,1073260879,817558,439525
00904b14b494,1073260825,1073260879,817558,439525
00904b312d9e,1073260825,1073260879,817558,439525
00022d15b1c7,1073260826,1073260966,820353,439280
00022dcbe817,1073260826,1073260966,820353,439280

我正在尝试实现无向加权/无加权图.

I am trying to implement undirected weighted / unweighted graph.

推荐答案

使用Pandas将数据获取到成对的节点列表中,其中每一行均根据您的边缘条件表示一条边缘.然后迁移到networkx对象进行图形分析.

Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx object for graph analysis.

两个节点共享一条边的标准包括:

The criteria for two nodes sharing an edge include:

  1. 相同的位置:假设这表示相同的gps1gps2.
  2. 几乎相同的开始时间和结束时间" .出于此答案的目的,我已将此标准减小为在相同的5秒间隔内开始时间" .如果要在边缘上应用其他时间条件,扩展我在这里采用的groupby方法应该并不难.
  1. Same location Assuming this means same gps1 AND gps2.
  2. "Near same start and end time" This is a little ambiguous. For the purposes of this answer I've reduced this criterion to "start time in the same 5-second interval". It shouldn't be too hard to extend the groupby approach I've taken here if you want to apply additional temporal conditions on edges.

由于我们要基于时间戳处理数据,因此将startend转换为datetime dtype:

Since we want to manipulate data based on timestamps, convert start and end to datetime dtype:

df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")

df.start.describe()
count                      35
unique                     11
top       2004-01-05 00:00:13
freq                        8
first     2004-01-05 00:00:01
last      2004-01-05 00:00:26
Name: start, dtype: object

df.head()
             ID               start                 end    gps1    gps2
0   0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03  819251  440006
1  00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10  819213  439954
2  00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40  817526  439458
3  00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50  817558  439525
4  00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25  817558  439525

样本观测值在几秒钟之内发生,因此我们将设置分组频率只有几秒钟:

The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:

near = "5s" 

现在groupby的位置和开始时间以查找连接的节点:

Now groupby location and start time to find connected nodes:

edges = (df.groupby(["gps1",
                     "gps2",
                     pd.Grouper(key="start", 
                                freq=near, 
                                closed="right", 
                                label="right")], 
                   as_index=False)
           .agg({"ID":','.join,
                 "start":"min",
                 "end":"max"})
            .reset_index()
            .rename(columns={"index":"edge",
                             "start":"start_min", 
                             "end":"end_max"})
        )

edges.ID = edges.ID.str.split(",")

edges.head():

   edge    gps1    gps2                                                 ID  \
0     0  817526  439458                                     [00904b4557d3]   
1     1  817558  439525  [00022de73863, 00904b14b494, 00904b14b494, 009...   
2     2  817558  439525         [00022de73863, 00904b14b494, 00904b312d9e]   
3     3  817721  439564  [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...   
4     4  817735  439757                       [003065d2d8b6, 00904b0c7856]   

            start_min             end_max  
0 2004-01-05 00:00:03 2004-01-05 00:18:40  
1 2004-01-05 00:00:04 2004-01-05 01:16:50  
2 2004-01-05 00:00:25 2004-01-05 00:01:19  
3 2004-01-05 00:00:13 2004-01-05 00:02:42  
4 2004-01-05 00:00:17 2004-01-05 01:52:40 

每行现在代表一个唯一的边缘类别. ID是所有共享该边缘的节点的列表.将此列表放入节点对的新结构中是有些棘手的.我求助于一些老式的嵌套for循环.这里可能有一些Pandas-fu可以提高效率:

Each row now represents a unique edge category. ID is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:

注意:对于单例节点,我已经为其对分配了None值.如果您不想跟踪单例,则只需忽略if not len(combos): ...逻辑.

Note: In the case of a singleton node, I've assigned a None value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ... logic.

pairs = []
idx = 0
for e in edges.edge.values:
    nodes = edges.loc[edges.edge==e, "ID"].values[0]
    attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
    combos = list(combinations(nodes, 2))
    if not len(combos):
        pair = [e, nodes[0], None]
        pair.extend(attrs.values[0])
        pairs.append(pair)
        idx += 1
    else:
        for combo in combos:
            pair = [e, combo[0], combo[1]]
            pair.extend(attrs.values[0])
            pairs.append(pair)
            idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)    

pairs_df.head():

   edge         nodeA         nodeB    gps1    gps2           start_min  \
0     0  00904b4557d3          None  817526  439458 2004-01-05 00:00:03   
1     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
2     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
3     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
4     1  00904b14b494  00904b14b494  817558  439525 2004-01-05 00:00:04   

              end_max  
0 2004-01-05 00:18:40  
1 2004-01-05 01:16:50  
2 2004-01-05 01:16:50  
3 2004-01-05 01:16:50  
4 2004-01-05 01:16:50      

现在数据可以适合networkx对象:

Now the data can be fit to a networkx object:

import networkx as nx

g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)

# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')

要进行社区检测,有几个选项.考虑 networkx社区算法,以及<一个href ="http://perso.crans.org/aynaud/communities/" rel ="noreferrer"> community 模块,该模块是基于本机networkx功能构建的.

For community detection, there are several options. Consider the networkx community algorithms, as well as the community module, which builds off of native networkx functionality.

我读到您的问题主要是关于将您的数据处理为适合网络分析的格式.由于这个答案已经足够长,我将留给您以寻求社区检测策略-与我在此处链接的模块可以直接使用几种方法.

I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.

这篇关于从节点列表中提取边缘和社区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆