如何在pyspark中使用Graphframes或igraph或networx查找顶点的成员 [英] How to find membership of vertices using Graphframes or igraph or networx in pyspark

查看:665
本文介绍了如何在pyspark中使用Graphframes或igraph或networx查找顶点的成员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入数据框是df

my input dataframe is df

    valx      valy 
1: 600060     09283744
2: 600131     96733110 
3: 600194     01700001

我想创建图,将上面两列视为Edgelist,然后我的输出应包含图的所有顶点及其成员资格的列表.

and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership .

我也尝试在pyspark和networx库中使用Graphframes,但是没有得到想要的结果

I have tried Graphframes in pyspark and networx library too, but not getting desired results

我的输出应如下所示(在V1下基本上是所有valx和valy(作为顶点),在V2下基本上是其成员资格信息)

My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2)

V1               V2
600060           1
96733110         1
01700001         3

我在下面尝试过

import networkx as nx
import pandas as pd

filelocation = r'Pathtodataframe df csv'

Panda_edgelist = pd.read_csv(filelocation)

g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``

推荐答案

我不确定您是否通过询问相同的问题违反这里的任何规则

I'm not sure if you are violating any rules here by asking the same question two times.

要检测带有图框的社区,首先必须创建一个图框对象.为您的示例数据帧提供以下代码片段,向您展示必要的转换:

To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:

from graphframes import *

sc.setCheckpointDir("/tmp/connectedComponents")


l = [
(  '600060'  , '09283744'),
(  '600131'  , '96733110'),
(  '600194'  , '01700001')
]

columns = ['valx', 'valy']

#this is your input dataframe 
edges = spark.createDataFrame(l, columns)

#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()

#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()

g = GraphFrame(vertices, edges)

输出:

+------+--------+ 
|   src|     dst| 
+------+--------+ 
|600060|09283744| 
|600131|96733110| 
|600194|01700001| 
+------+--------+ 
+--------+ 
|      id| 
+--------+ 
|  600060| 
|  600131| 
|  600194| 
|09283744| 
|96733110| 
|01700001| 
+--------+

您在其他已连接的组件:

result = g.connectedComponents()
result.show()

输出:

+--------+------------+ 
|      id|   component| 
+--------+------------+ 
|  600060|163208757248| 
|  600131| 34359738368| 
|  600194|884763262976| 
|09283744|163208757248| 
|96733110| 34359738368| 
|01700001|884763262976| 
+--------+------------+

其他社区检测算法(例如LPA)可以在用户指南.

Other community detection algorithms (like LPA) can be found in the user guide.

这篇关于如何在pyspark中使用Graphframes或igraph或networx查找顶点的成员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆