有效计算pyspark中的连接组件 [英] efficiently calculating connected components in pyspark

查看:43
本文介绍了有效计算pyspark中的连接组件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为城市中的朋友寻找连接组件.我的数据是具有城市属性的边列表.

I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.

城市 |资源中心 |目的地

City | SRC | DEST

休斯顿凯尔 -> 本尼

Houston Kyle -> Benny

休斯顿本尼 -> 查尔斯

Houston Benny -> Charles

休斯顿查尔斯 -> 丹尼

Houston Charles -> Denny

奥马哈卡罗尔 -> 布莱恩

Omaha Carol -> Brian

等等.

我知道 pyspark 的 GraphX 库的 connectedComponents 函数将遍历图的所有边以找到连接的组件,我想避免这种情况.我该怎么做?

I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?

我以为我可以做类似的事情

edit: I thought I could do something like

从数据框中选择 connected_components(*)按城市分组

select connected_components(*) from dataframe groupby city

connected_components 生成项目列表.

where connected_components generates a list of items.

推荐答案

假设你的数据是这样的

import org.apache.spark._
import org.graphframes._

val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"),
            ("Houston","Charles","Denny"),("Omaha","carol","Brian"),
            ("Omaha","Brian","Daniel"),("Omaha","Sara","Marry"))
var df = spark.createDataFrame(l).toDF("city","src","dst")

创建要运行连接组件的城市列表cities = List("Houston","Omaha")

Create a list of cities for which you want to run connected components cities = List("Houston","Omaha")

现在对城市列表中的每个城市的城市列运行过滤器,然后从结果数据框中创建边和顶点数据框.从这些边和顶点数据框创建图形框并运行连通分量算法

Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm

val cities = List("Houston","Omaha")

for(city <- cities){
    val edges = df.filter(df("city") === city).drop("city")
    val vert = edges.select("src").union(edges.select("dst")).
                     distinct.select(col("src").alias("id"))
    val g = GraphFrame(vert,edges)
    val res = g.connectedComponents.run()
    res.select("id", "component").orderBy("component").show()
}

输出

|     id|   component|
+-------+------------+
|   Kyle|249108103168|
|charles|249108103168|
|  Benny|249108103168|
|Charles|721554505728|
|  Denny|721554505728|
+-------+------------+

+------+------------+                                                           
|    id|   component|
+------+------------+
| Marry|858993459200|
|  Sara|858993459200|
| Brian|944892805120|
| carol|944892805120|
|Daniel|944892805120|
+------+------------+

这篇关于有效计算pyspark中的连接组件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆