Spark迭代/递归算法 - 打破火花谱系 [英] Spark iterative/recursive algorithms - Breaking spark lineage

查看:30
本文介绍了Spark迭代/递归算法 - 打破火花谱系的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个递归 spark 算法,可以将 10 天的滑动窗口应用于数据集.

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.

原始数据集从按日期分区的 Hive 表加载.

The original dataset is loaded from a Hive table partitioned by date.

在每次迭代中,一组复杂的操作被应用于包含十天窗口的数据集.

At each iteration a complex set of operations is applied to Dataset containing the ten day window.

然后将最后一个日期插回到原始 Hive 表中,下一个日期从 Hive 加载并合并到剩余的 9 天.

The last date is then inserted back into the original Hive table and the next date loaded from Hive and unioned to the remaining nine days.

我意识到我需要打破火花谱系以防止 DAG 变得难以管理.

I realise that I need to break the spark lineage to prevent the DAG from growing unmanageable.

我相信我有两个选择:

  1. 检查点 - 涉及对 HDFS 的昂贵写入.
  2. 转换为 rdd 并再次返回

  1. Checkpointing - involves a costly write to HDFS.
  2. Convert to rdd and back again

spark.createDataset(myDS.rdd)

spark.createDataset(myDS.rdd)

使用第二个选项是否有任何缺点 - 我假设这是内存操作,因此更便宜.

Are there any disadvantages using the second option - I am assuming this is an in memory operation and is therefore cheaper.

推荐答案

检查指向并转换回 RDD 确实是截断谱系的最佳/唯一方法.

Check pointing and converting back to RDD are indeed the best/only ways to truncate lineage.

许多(全部?)Spark ML 数据集/数据帧算法实际上是使用 RDD 实现的,但由于优化器未并行化以及迭代/递归实现的沿袭大小,公开的 API 是 DS/DF.

Many (all?) of the Spark ML Dataset/DataFrame algorithms are actually implemented using RDDs, but the APIs exposed are DS/DF due to the optimizer not being parallelized and lineage size from iterative/recursive implementations.

在 RDD 和 RDD 之间进行转换是有成本的,但比文件系统检查点选项小.

There is a cost to converting to and from RDD, but smaller than the file system checkpointing option.

这篇关于Spark迭代/递归算法 - 打破火花谱系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆