如何使用Long数据类型在Apache Spark GraphX中创建VertexId? [英] How to create a VertexId in Apache Spark GraphX using a Long data type?
问题描述
我正在尝试使用一些Google Web Graph数据创建一个Graph,可以在这里找到:
I'm trying to create a Graph using some Google Web Graph data which can be found here:
https://snap.stanford.edu/data/web-Google.html
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val textFile = sc.textFile("hdfs://n018-data.hursley.ibm.com/user/romeo/web-Google.txt")
val arrayForm = textFile.filter(_.charAt(0)!='#').map(_.split("\\s+")).cache()
val nodes = arrayForm.flatMap(array => array).distinct().map(_.toLong)
val edges = arrayForm.map(line => Edge(line(0).toLong,line(1).toLong))
val graph = Graph(nodes,edges)
不幸的是,我收到此错误:
Unfortunately, I get this error:
<console>:27: error: type mismatch;
found : org.apache.spark.rdd.RDD[Long]
required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, ?)]
Error occurred in an application involving default arguments.
val graph = Graph(nodes,edges)
那么如何创建VertexId对象?以我的理解,通过Long足够了.
So how can I create a VertexId object? For my understanding it should be sufficient to pass a Long.
有什么想法吗?
非常感谢!
romeo
推荐答案
不完全是.如果查看Graph
对象的apply
方法的签名,您会看到类似这样的信息(有关完整签名,请参见
Not exactly. If you take a look at the signature of the apply
method of the Graph
object you'll see something like this (for a full signature see API docs):
apply[VD, ED](
vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD)
您可以阅读说明:
根据具有属性的顶点和边的集合来构造图形.
Construct a graph from a collection of vertices and edges with attributes.
因此,您不能简单地将RDD[Long]
作为vertices
参数传递(RDD[Edge[Nothing]]
作为edges
也不起作用).
Because of that you cannot simply pass RDD[Long]
as a vertices
argument ( RDD[Edge[Nothing]]
as edges
won't work either).
import scala.{Option, None}
val nodes: RDD[(VertexId, Option[String])] = arrayForm.
flatMap(array => array).
map((_.toLong, None))
val edges: RDD[Edge[String]] = arrayForm.
map(line => Edge(line(0).toLong, line(1).toLong, ""))
请注意:
任意选择重复的顶点
Duplicate vertices are picked arbitrarily
因此nodes
上的.distinct()
已过时.
如果要创建不带属性的Graph
,则可以使用Graph.fromEdgeTuples
.
If you want to create a Graph
without attributes you can use Graph.fromEdgeTuples
.
这篇关于如何使用Long数据类型在Apache Spark GraphX中创建VertexId?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!