在Spark GraphX中使用广播变量和RDD.filter比较两个节点之间的交集 [英] Comparing intersection between two nodes using broadcast variable and using RDD.filter in Spark GraphX

查看:115
本文介绍了在Spark GraphX中使用广播变量和RDD.filter比较两个节点之间的交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在GraphX中处理图形.通过使用下面的代码,我创建了一个变量来存储RDD中节点的邻居:

  val all_neighbors:VertexRDD [Array [VertexId]] = graph.collectNeighborIds(EdgeDirection.Either) 

我使用广播变量通过以下代码向所有从站广播邻居:

  val broadcastVar = all_neighbors.collect().toMapval nvalues = sc.broadcast(broadcastVar) 

我想计算两个节点邻居之间的交集.例如节点1和节点2邻居之间的交集.

首先,我将以下代码用于计算使用广播变量nvalues的交集:

  val common_neighbors = nvalues.value(1).intersect(nvalues.value(2)) 

一旦我使用下面的代码来计算两个节点的交点:

  val common_neighbors2 =(all_neighbors.filter(x => x._1 == 1)).intersection(all_neighbors.filter(x => x._1 == 2)) 

我的问题是:以上哪种方法有效,分布更广泛,更并行?使用广播变量 nvalue 计算交集还是使用过滤RDD 方法?

解决方案

我认为这取决于情况.

如果您的 nvalues 大小较小并且可以适合每个执行器和驱动程序节点,则广播方法将是最佳方法,因为数据被缓存在执行器中,并且不会重新计算该数据,并且再次.此外,它将节省大量通信和计算负担.在这种情况下,另一种方法不是最佳方法,因为可能每次都会计算 all_neighbours rdd,这会降低性能,因为会产生大量的重新计算,并且会增加计算成本.

如果您的 nvalues 无法放入每个执行程序和驱动程序节点,广播将无法工作,因为它将引发错误.因此,别无选择,只能使用第二种方法,尽管它仍然可能会导致性能问题,至少代码可以工作!!

让我知道是否有帮助!

i work on graphs in GraphX. by using the below code i have made a variable to store neighbors of nodes in RDD:

val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either)

i used broadcast variable to broadcast neighbors to all slaves by using below code:

val broadcastVar = all_neighbors.collect().toMap
val nvalues = sc.broadcast(broadcastVar)

i want to compute intersection between two nodes neighbors. for example intersection between node 1 and node 2 neighbors.

At first i use this code for computing intersection that uses the broadcast variable nvalues:

val common_neighbors=nvalues.value(1).intersect(nvalues.value(2))

and once i used the below code for computing intersection of two nodes:

val common_neighbors2=(all_neighbors.filter(x=>x._1==1)).intersection(all_neighbors.filter(x=>x._1==2))

my question is this: which one of the above methods is efficient and more distributed and parallel? using the broadcast variable nvalue for computing intersection or using filtering RDD method?

解决方案

I think it depends on the situation.

In the case where your nvalues size is less and can fit into each executor and driver node, the approach with broadcasting will be optimal as data is cached in executors and this data is not recomputed over and over again. Also, it will save spark a huge communication and compute burden. In such cases, the other approach is not optimal as it might happen that all_neighbours rdd is calculated every time and this will decrease the performance as there will be a lot of recomputations and will increase computation cost.

In the case where your nvalues cannot fit into each executor and driver node, broadcasting will not work as it will throw an error. Hence, there is no option left but to use the second approach though it might still cause performance issues at least code will work!!

Let me know if it helps!!

这篇关于在Spark GraphX中使用广播变量和RDD.filter比较两个节点之间的交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆