在Apache Spark中联接文件 [英] Join files in Apache Spark
本文介绍了在Apache Spark中联接文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个像这样的文件. code_count.csv
I have a file like this. code_count.csv
code,count,year
AE,2,2008
AE,3,2008
BX,1,2005
CD,4,2004
HU,1,2003
BX,8,2004
另一个这样的文件. details.csv
Another file like this. details.csv
code,exp_code
AE,Aerogon international
BX,Bloomberg Xtern
CD,Classic Divide
HU,Honololu
我想要每个代码的总和,但是在最终输出中,我想要exp_code.像这样
I want the total sum for each code but in the final output, I want the exp_code. Like this
Aerogon international,5
Bloomberg Xtern,4
Classic Divide,4
这是我的代码
var countData=sc.textFile("C:\path\to\code_count.csv")
var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1))
var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)})
sum.take(2)
给予
Array[(String, Int)] = Array((AE,5), (BX,9))
这里的总和是RDD [(String,Int)].我对如何从其他文件中提取exp_code感到困惑.请指导.
Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.
推荐答案
您需要使用代码在groupby之后计算总和,然后加入另一个数据帧.下面是类似的示例.
You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004)))
.toDF("code","count","year")
val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"),
("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code")
val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count"))
val finalDF = sumdf1.join(df2, "code").drop("code")
finalDF.show()
这篇关于在Apache Spark中联接文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文