Spark迭代/递归算法-打破火花沿袭 [英] Spark iterative/recursive algorithms - Breaking spark lineage

查看:193
本文介绍了Spark迭代/递归算法-打破火花沿袭的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个递归火花算法,它将10天的滑动窗口应用于数据集.

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.

原始数据集是从按日期划分的Hive表中加载的.

The original dataset is loaded from a Hive table partitioned by date.

在每次迭代中,都会对包含十天窗口的数据集应用一组复杂的操作.

At each iteration a complex set of operations is applied to Dataset containing the ten day window.

然后将最后一个日期重新插入到原始Hive表中,并从Hive加载下一个日期并将其合并为剩余的9天.

The last date is then inserted back into the original Hive table and the next date loaded from Hive and unioned to the remaining nine days.

我意识到我需要打破火花谱系,以防止DAG变得难以管理.

I realise that I need to break the spark lineage to prevent the DAG from growing unmanageable.

我相信我有两种选择:

  1. 检查点-涉及到对HDFS的昂贵写入.
  2. 转换为rdd并再次返回

  1. Checkpointing - involves a costly write to HDFS.
  2. Convert to rdd and back again

spark.createDataset(myDS.rdd)

spark.createDataset(myDS.rdd)

使用第二个选项是否有任何缺点-我假设这是内存操作,因此更便宜.

Are there any disadvantages using the second option - I am assuming this is an in memory operation and is therefore cheaper.

推荐答案

检查指向并转换回RDD确实是截断谱系的最佳/唯一方法.

Check pointing and converting back to RDD are indeed the best/only ways to truncate lineage.

许多(全部?)Spark ML数据集/数据帧算法实际上是使用RDD实现的,但是由于优化程序未并行化以及迭代/递归实现的沿袭大小,因此公开的API是DS/DF.

Many (all?) of the Spark ML Dataset/DataFrame algorithms are actually implemented using RDDs, but the APIs exposed are DS/DF due to the optimizer not being parallelized and lineage size from iterative/recursive implementations.

与RDD进行转换需要一定的成本,但要比文件系统检查点选项小.

There is a cost to converting to and from RDD, but smaller than the file system checkpointing option.

这篇关于Spark迭代/递归算法-打破火花沿袭的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆