在Apache Spark中联接文件 [英] Join files in Apache Spark

查看:65
本文介绍了在Apache Spark中联接文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的文件. code_count.csv

I have a file like this. code_count.csv

code,count,year
AE,2,2008
AE,3,2008
BX,1,2005
CD,4,2004
HU,1,2003
BX,8,2004

另一个这样的文件. details.csv

Another file like this. details.csv

code,exp_code
AE,Aerogon international
BX,Bloomberg Xtern
CD,Classic Divide
HU,Honololu

我想要每个代码的总和,但是在最终输出中,我想要exp_code.像这样

I want the total sum for each code but in the final output, I want the exp_code. Like this

Aerogon international,5
Bloomberg Xtern,4
Classic Divide,4

这是我的代码

var countData=sc.textFile("C:\path\to\code_count.csv")
var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1))
var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)})
sum.take(2)

给予

Array[(String, Int)] = Array((AE,5), (BX,9))

这里的总和是RDD [(String,Int)].我对如何从其他文件中提取exp_code感到困惑.请指导.

Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.

推荐答案

您需要使用代码在groupby之后计算总和,然后加入另一个数据帧.下面是类似的示例.

You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.

import spark.implicits._
  val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004)))
    .toDF("code","count","year")

  val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"),
    ("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code")


  val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count"))

  val finalDF = sumdf1.join(df2, "code").drop("code")

finalDF.show()

这篇关于在Apache Spark中联接文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆