如何从Pyspark中的spark数据帧创建边缘列表? [英] How to create edge list from spark data frame in Pyspark?

查看:93
本文介绍了如何从Pyspark中的spark数据帧创建边缘列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在pyspark中使用 graphframes 进行某种图形类型的分析,并想知道从顶点数据框创建边缘列表数据框的最佳方法是什么.

I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame.

例如,下面是我的顶点数据框.我有一个ID列表,它们属于不同的组.

For example, below is my vertices data frame. I have a list of ids and they belong to different groups.

+---+-----+
|id |group|
+---+-----+
|a  |1    |
|b  |2    |
|c  |1    |
|d  |2    |
|e  |3    |
|a  |3    |
|f  |1    |
+---+-----+

我的目标是创建一个边缘列表数据框,以指示出现在普通组中的ID.请注意,1个ID可能会出现在多个组中(例如,上面的id在组1和3中).以下是我要获取的边缘列表数据框:

My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id could appear in multiple groups (e.g. id a above is in group 1 and 3). Below is the edge list data frame that I'd like to get:

+---+-----+-----+
|src|dst  |group|
+---+-----+-----+
|a  |c    |1    |
|a  |f    |1    |
|c  |f    |1    |
|b  |d    |2    |
|a  |e    |3    |
+---+-----+-----+

提前谢谢!

推荐答案

编辑1

不确定这是否是更好的解决方法,但是我做了一个变通方法:

Edit 1

Not sure if it's the better way to solve, but I did a workaround:

import pyspark.sql.functions as f

df = df.withColumn('match', f.collect_set('id').over(Window.partitionBy('group')))

df = df.select(f.col('id').alias('src'),
               f.explode('match').alias('dst'),
               f.col('group'))

df = df.withColumn('duplicate_edges', f.array_sort(f.array('src', 'dst')))
df = (df
      .where(f.col('src') != f.col('dst'))
      .drop_duplicates(subset=['duplicate_edges'])
      .drop('duplicate_edges'))

df.sort('group', 'src', 'dst').show()

输出

+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  e|  a|    3|
+---+---+-----+

原始答案

尝试一下:

import pyspark.sql.functions as f

df = (df
      .groupby('group')
      .agg(f.first('id').alias('src'),
           f.last('id').alias('dst')))

df.show()

输出:

+-----+---+---+
|group|src|dst|
+-----+---+---+
|    1|  a|  c|
|    3|  e|  a|
|    2|  b|  d|
+-----+---+---+

这篇关于如何从Pyspark中的spark数据帧创建边缘列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆