通过非持久图在Spark和GraphX中获得连接具有不同索引的两个VertexPartition的速度很慢 [英] Get Joining two VertexPartitions with different indexes is slow in Spark and GraphX by unpersist graph

查看：307 发布时间：2020/9/4 7:32:34 scala apache-spark

本文介绍了通过非持久图在Spark和GraphX中获得连接具有不同索引的两个VertexPartition的速度很慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对不起，标题太长，如果您能理解我的意思，请帮助我进行编辑，谢谢.

Sorry about the inaccurate and long title, if you can understand what I'm saying, please help me edit it, thanks.

代码如下.如果执行它，将会得到

The code is as follows. If you execute it, you will get

14/06/12 14:33:24 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.

但是，如果您评论graph.unpersistVertices(blocking = false)，则不会出现此类警告. 所以我很好奇为什么这会改变Graph对象的索引?

But if you comment graph.unpersistVertices(blocking = false), then there will be no such warning. So I'm curious about why this will change the index of Graph object?

object Test {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Test")
      .setMaster("local[4]")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(conf)

    val v: RDD[(VertexId, Int)] = sc.parallelize(Seq((0L,0),(1L,1),(2L,2)))
    val e: RDD[Edge[Int]] = sc.parallelize(Seq(Edge(0, 1, 0), Edge(0, 2, 0), Edge(1, 0, 0), Edge(2, 1, 0)))


    val g = Graph(v, e)

    def test(graph: Graph[Int, Int]) = {
      graph.cache()
      val ng = graph.outerJoinVertices(graph.outDegrees){
        (vid, vd, out) => (vd, out.getOrElse(vid, 0))
      }

      val f = ng.subgraph(epred = _.srcId != 0, vpred = (vid, vd) => vid != 0L)
      f.cache()
      graph.unpersistVertices(blocking = false)
      f
    }

    val f1 = test(g)

    println(f1.numVertices)

  }
}

据我所知，当您在GraphX的Graph上进行类似mapValue的操作时，将重新使用RDD(VertexRDD)的索引以避免重新计算.当您执行子图之类的操作时，您仍然可以通过在其上应用位掩码来重用那些索引.因为outerJoinVertices仅修改RDD的值，所以会进行某种操作吗?

According to my knowledge, when you do a manipulation, like mapValue, on a GraphX's Graph, the index of RDD(VertexRDD) will be reused to avoid re-computation. When you do something like subgraph, you will still some sort of reuse those indexes by applying bit mask on it. Does outerJoinVertices some kind of manipulation since it only modify the value of a RDD?

而且，我cache()在较旧的图之前的cache()新图，所以我认为unpersist不会影响已缓存的图，因为我们已经对其进行了缓存，但是我错了.

Moreover, I cache() the new graph before unpersist the older one, so I think the unpersist will not affect the cached graph since we already cached it, but I'm wrong.

缓存和非持久化如何工作?为什么由于我实际上没有加入分区，所以它们会影响索引?

How does the cache and unpersist work? Why they will affect the indexes since I'm not actually joining partitions?

更新:我查看了代码，numVertices实际上是一个map and reduce方法 partitionsRDD.map(_.size).reduce(_ + _).因此，连接发生在这一行.

Update: I look into the code and numVertices is actually a map and reduce method partitionsRDD.map(_.size).reduce(_ + _). So the joining happens at this line.

通过非持久图在Spark和GraphX中获得连接具有不同索引的两个VertexPartition的速度很慢 [英] Get Joining two VertexPartitions with different indexes is slow in Spark and GraphX by unpersist graph

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过非持久图在Spark和GraphX中获得连接具有不同索引的两个VertexPartition的速度很慢 [英] Get Joining two VertexPartitions with different indexes is slow in Spark and GraphX by unpersist graph

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭