如何创建数组[(任何,任何)使用Graph.fromEdgeTuples图 [英] How to create a graph from Array[(Any, Any)] using Graph.fromEdgeTuples
问题描述
我很新的火花,但我想创建关系,我从一个蜂巢表中获取的图表。我发现,应该允许这种没有确定顶点的功能,但我不能得到它的工作。
I am very new to spark but I want to create a graph from relations that I get from a Hive table. I found a function that is supposed to allow this without defining the vertices but I can't get it to work.
我知道这是不是一个重复的例子,但这里是我的code:
I know this isn't a reproducible example but here is my code :
import org.apache.spark.SparkContext
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val sqlContext= new org.apache.spark.sql.hive.HiveContext(sc)
val data = sqlContext.sql("select year, trade_flow, reporter_iso, partner_iso, sum(trade_value_us) from comtrade.annual_hs where length(commodity_code)='2' and not partner_iso='WLD' group by year, trade_flow, reporter_iso, partner_iso").collect()
val data_2010 = data.filter(line => line(0)==2010)
val couples = data_2010.map(line=>(line(2),line(3)) //country to country
val graph = Graph.fromEdgeTuples(couples, 1)
最后一行生成以下错误:
The last line generates the following error :
val graph = Graph.fromEdgeTuples(sc.parallelize(couples), 1)
<console>:31: error: type mismatch;
found : Array[(Any, Any)]
required: Seq[(org.apache.spark.graphx.VertexId,org.apache.spark.graphx.VertexId)]
Error occurred in an application involving default arguments.
val graph = Graph.fromEdgeTuples(sc.parallelize(couples), 1)
夫妻是这样的:
couples: Array[(Any, Any)] = Array((MWI,MOZ), (WSM,AUS), (MDA,CRI), (KNA,HTI), (PER,ERI), (SWE,CUB), (DEU,PRK), (THA,DJI), (BIH,SVK), (RUS,THA), (SGP,BLR), (MEX,TGO), (TUR,ZAF), (ZWE,SYC), (UGA,GHA), (OMN,SVN), (NZL,SYR), (CHE,SLV), (CZE,LUX), (TGO,COM), (TTO,WLF), (NGA,PAN), (FJI,UKR), (BRA,ECU), (EGY,SWE), (ITA,ARG), (MUS,MLT), (MDG,DZA), (ARE,SUR), (CAN,GUY), (OMN,COG), (NAM,FIN), (ITA,HMD), (SWE,CHE), (SDN,NER), (TUN,USA), (THA,GMB), (HUN,TTO), (FRA,BEN), (NER,TCD), (CHN,JPN), (DNK,ZAF), (MLT,UKR), (ARM,OMN), (PRT,IDN), (BEN,PER), (TTO,BRA), (KAZ,SMR), (CPV,""), (ARG,ZAF), (BLR,TJK), (AZE,SVK), (ITA,STP), (MDA,IRL), (POL,SVN), (PRY,ETH), (HKG,MOZ), (QAT,GAB), (THA,MUS), (PHL,MOZ), (ITA,SGS), (ARM,KHM), (ARG,KOR), (AUT,GMB), (SYR,COM), (CZE,GBR), (DOM,USA), (CYP,LAO), (USA,LBR)
我怎么能转换成合适的格式?
How can I convert to the suitable format ?
推荐答案
所有不能使用的第一字符串
为 VertexId
,所以你必须映射标签龙
。首先,我们需要哟prepare映射从标签ID。只要唯一值的数目是相对小的最简单的方法是创建广播变量
First of all you cannot use String
as a VertexId
so you have to map labels to Long
. First we need yo prepare mapping from label to id. As long as number of the unique values is relatively small the simplest approach is to create a broadcast variable:
val idMap = sc.broadcast(couples // -> Array[(Any, Any)]
// Make sure we use String not Any returned from Row.apply
// And convert to Seq so we can flatten results
.flatMap{case (x: String, y: String) => Seq(x, y)} // -> Array[String]
// Get different keys
.distinct // -> Array[String]
// Create (key, value) pairs
.zipWithIndex // -> Array[(String, Int)]
// Convert values to Long so we can use it as a VertexId
.map{case (k, v) => (k, v.toLong)} // -> Array[(String, Long)]
// Create map
.toMap) // -> Map[String,Long]
接下来,我们可以使用上面进行映射:
Next we can use above to perform mapping:
val edges: RDD[(VertexId, VertexId)] = sc.parallelize(couples
.map{case (x: String, y: String) => (idMap.value(x), idMap.value(y))}
)
最后一个图:
val graph = Graph.fromEdgeTuples(edges, 1)
这篇关于如何创建数组[(任何,任何)使用Graph.fromEdgeTuples图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!