Spark迭代/递归算法 - 打破火花谱系 [英] Spark iterative/recursive algorithms - Breaking spark lineage

查看：30 发布时间：2021/11/14 22:01:21 apache-spark apache-spark-sql

本文介绍了Spark迭代/递归算法 - 打破火花谱系的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个递归 spark 算法，可以将 10 天的滑动窗口应用于数据集.

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.

原始数据集从按日期分区的 Hive 表加载.

The original dataset is loaded from a Hive table partitioned by date.

在每次迭代中，一组复杂的操作被应用于包含十天窗口的数据集.

At each iteration a complex set of operations is applied to Dataset containing the ten day window.

然后将最后一个日期插回到原始 Hive 表中，下一个日期从 Hive 加载并合并到剩余的 9 天.

The last date is then inserted back into the original Hive table and the next date loaded from Hive and unioned to the remaining nine days.

我意识到我需要打破火花谱系以防止 DAG 变得难以管理.

I realise that I need to break the spark lineage to prevent the DAG from growing unmanageable.

我相信我有两个选择:

spark.createDataset(myDS.rdd)

使用第二个选项是否有任何缺点 - 我假设这是内存操作，因此更便宜.

Are there any disadvantages using the second option - I am assuming this is an in memory operation and is therefore cheaper.