Spark迭代/递归算法 - 打破火花谱系 [英] Spark iterative/recursive algorithms - Breaking spark lineage
问题描述
我有一个递归 spark 算法,可以将 10 天的滑动窗口应用于数据集.
I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.
原始数据集从按日期分区的 Hive 表加载.
The original dataset is loaded from a Hive table partitioned by date.
在每次迭代中,一组复杂的操作被应用于包含十天窗口的数据集.
At each iteration a complex set of operations is applied to Dataset containing the ten day window.
然后将最后一个日期插回到原始 Hive 表中,下一个日期从 Hive 加载并合并到剩余的 9 天.
The last date is then inserted back into the original Hive table and the next date loaded from Hive and unioned to the remaining nine days.
我意识到我需要打破火花谱系以防止 DAG 变得难以管理.
I realise that I need to break the spark lineage to prevent the DAG from growing unmanageable.
我相信我有两个选择:
- 检查点 - 涉及对 HDFS 的昂贵写入.
转换为 rdd 并再次返回
- Checkpointing - involves a costly write to HDFS.
Convert to rdd and back again
spark.createDataset(myDS.rdd)
spark.createDataset(myDS.rdd)
使用第二个选项是否有任何缺点 - 我假设这是内存操作,因此更便宜.
Are there any disadvantages using the second option - I am assuming this is an in memory operation and is therefore cheaper.
推荐答案
检查指向并转换回 RDD 确实是截断谱系的最佳/唯一方法.
Check pointing and converting back to RDD are indeed the best/only ways to truncate lineage.
许多(全部?)Spark ML 数据集/数据帧算法实际上是使用 RDD 实现的,但由于优化器未并行化以及迭代/递归实现的沿袭大小,公开的 API 是 DS/DF.
Many (all?) of the Spark ML Dataset/DataFrame algorithms are actually implemented using RDDs, but the APIs exposed are DS/DF due to the optimizer not being parallelized and lineage size from iterative/recursive implementations.
在 RDD 和 RDD 之间进行转换是有成本的,但比文件系统检查点选项小.
There is a cost to converting to and from RDD, but smaller than the file system checkpointing option.
这篇关于Spark迭代/递归算法 - 打破火花谱系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!