Spark迭代/递归算法-打破火花沿袭 [英] Spark iterative/recursive algorithms - Breaking spark lineage

查看：193 发布时间：2020/9/4 2:15:04 apache-spark apache-spark-sql

本文介绍了Spark迭代/递归算法-打破火花沿袭的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个递归火花算法，它将10天的滑动窗口应用于数据集.

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.

原始数据集是从按日期划分的Hive表中加载的.

The original dataset is loaded from a Hive table partitioned by date.

在每次迭代中，都会对包含十天窗口的数据集应用一组复杂的操作.

At each iteration a complex set of operations is applied to Dataset containing the ten day window.

然后将最后一个日期重新插入到原始Hive表中，并从Hive加载下一个日期并将其合并为剩余的9天.

The last date is then inserted back into the original Hive table and the next date loaded from Hive and unioned to the remaining nine days.

我意识到我需要打破火花谱系，以防止DAG变得难以管理.

I realise that I need to break the spark lineage to prevent the DAG from growing unmanageable.

我相信我有两种选择:

spark.createDataset(myDS.rdd)

使用第二个选项是否有任何缺点-我假设这是内存操作，因此更便宜.

Are there any disadvantages using the second option - I am assuming this is an in memory operation and is therefore cheaper.