从成对的String中生成VertexId [英] Generate `VertexId` from pairs of `String`

查看:107
本文介绍了从成对的String中生成VertexId的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用GraphX在Spark上处理一些图形数据.输入数据为 RDD [(String,String)] .我使用以下代码段将 String 映射到 VertexId 并构建图形.

I'm using GraphX to process some graph data on Spark. The input data is given as RDD[(String, String)]. I used the following snippet to map String to VertexId and build the graph.

val input: RDD[(String, String)] = ...

val vertexIds = input.map(_._1)
                     .union(input.map(_._2))
                     .distinct()
                     .zipWithUniqueId()
                     .cache()

val edges = input.join(vertexIds)
                 .map { case (u, (v, uid)) => (v, uid) }
                 .join(vertexIds)
                 .map { case (v, (uid, vid)) => Edge(uid, vid, 1) }

val graph = Graph(vertexIds.map { case (v, vid) => (vid, v) }, edges )

当我进行抽查以查看排名前1000的最高度节点时,我发现GraphX的结果与原始输入不同.这是我转储高度节点的方法

When I did a spot check to see the top 1000 highest degree nodes, I found the result of GraphX is different from the original input. Here's how I dump the high degree node

graph.outerJoinVertices(graph.outDegrees) {
  (_, vdata, deg) => (deg.getOrElse(0L), vdata)
}.vertices.map(_._2).top(1000).saveTo(....)

我怀疑 .zipWithUniqueId 会为每次评估提供不稳定的ID.我尝试过

I suspect .zipWithUniqueId gives unstable ids for each evaluation. I tried

  • 插入 vertexIds.count()以强制实现,这样就不会重新评估 vertexIds .
  • 插入 .sortBy(...).zipWithUniqueId()以确保顺序相同.
  • insert vertexIds.count() to force materialization so that vertexIds doesn't get reevaluated.
  • insert .sortBy(...).zipWithUniqueId() to make sure the ordering is the same.

它们都不能解决问题.每次运行的前1000个度节点的结果略有不同.

Neither of them solves the problem. The results of top 1000 degree nodes differ slightly for each run.

推荐答案

我发现了两种用于稳定 String->的解决方案.VertexId 映射:

I found two solutions to stabilize the String -> VertexId mapping:

  • vertexIds 保留为FS.

input.map(_._1)
     .union(input.map(_._2))
     .distinct()
     .zipWithUniqueId()
     .saveAsObjectFile("some location")
val vertexId = sc.objectFile("some location")

  • 使用抗冲突哈希函数.我使用了番石榴的murmur3_128哈希值,并将前8个字节作为vertexId.使用这种方法,您无需进行任何进一步的联接,这将更加高效.

  • use a collision-resistant hash function. I used Guava's murmur3_128 hash and took the first 8 bytes as vertexId. With this approach, you don't need to do any further joins, which is more efficient.

    这篇关于从成对的String中生成VertexId的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆