腹胀火花和放一个数据集;斯卡拉 [英] Bloating a dataset with spark & scala
本文介绍了腹胀火花和放一个数据集;斯卡拉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
下面是我的要求
输入
customer_id status start_date end_date
1 Y 20140101 20140105
2 Y 20140201 20140203
输出
customer_id status date
1 Y 20140101
1 Y 20140102
1 Y 20140103
1 Y 20140104
1 Y 20140105
2 Y 20140201
2 Y 20140202
2 Y 20140202
我想在火花笛卡尔积来实现这一点,它看起来非常低效的。我的数据过于庞大。我在寻找一个更好的选择。
I'm trying to achieve this with cartesian product in spark and it looks to very inefficient. My dataset is too huge. I'm looking for a better option.
推荐答案
如果我正确地得到了你的想法,你可以这样来做:
If I got your idea correctly you can do it this way:
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val sc = new SparkContext(conf)
case class Input(customerId: Long, status: String, startDate: LocalDate, endDate: LocalDate)
case class Output(customerId: Long, status: String, date: LocalDate)
val input: RDD[Input] = sc.parallelize(Seq(
Input(1, "Y", LocalDate.of(2014, 1, 1), LocalDate.of(2014, 1, 5)),
Input(2, "Y", LocalDate.of(2014, 1, 1), LocalDate.of(2014, 1, 3))
))
val result: RDD[Output] = input flatMap { input =>
import input._
val dates = Stream.iterate(startDate)(_.plusDays(1)).takeWhile(!_.isAfter(endDate))
dates.map(date => Output(customerId, status, date))
}
result.collect().foreach(println)
输出:
Output(1,Y,2014-01-01)
Output(1,Y,2014-01-02)
Output(1,Y,2014-01-03)
Output(1,Y,2014-01-04)
Output(1,Y,2014-01-05)
Output(2,Y,2014-01-01)
Output(2,Y,2014-01-02)
Output(2,Y,2014-01-03)
这篇关于腹胀火花和放一个数据集;斯卡拉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文