Spark:数据帧检查点与显式写入磁盘的效率 [英] Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

查看:25
本文介绍了Spark:数据帧检查点与显式写入磁盘的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

检查点版本:

val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()

写入磁盘版本:

df.write.parquet(savePath)
val df = spark.read.parquet(savePath)

我认为两者都以同样的方式打破了血统.

I think both break the lineage in the same way.

在我的实验中,磁盘上的 checkpoint 几乎比 parquet 大 30(689GB 与 24GB).在运行时间方面,检查点需要 1.5 倍的时间(10.5 分钟 vs 7.5 分钟).

In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).

考虑到所有这些,使用检查点而不是保存到文件有什么意义?我错过了什么吗?

Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?

推荐答案

检查点是截断 RDD 谱系图并将其保存到可靠的分布式 (HDFS) 或本地文件系统的过程.如果你有一个很大的 RDD 谱系图并且你想冻结当前 RDD 的内容,即在继续下一步之前实现完整的 RDD,你通常使用持久或检查点.然后可以将检查点 RDD 用于其他目的.

Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.

当您检查点时,RDD 被序列化并存储在磁盘中.它不以镶木地板格式存储,因此数据未在磁盘中正确存储优化.与 parquet 相反,它提供各种压缩和编码来存储优化数据.这将解释大小的差异.

When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.

  • 您绝对应该考虑在嘈杂的集群中设置检查点.如果有大量作业和用户争用资源,而没有足够的资源同时运行所有作业,则集群称为嘈杂.

  • You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.

如果你的计算真的很昂贵并且需要很长时间才能完成,你必须考虑检查点,因为将 RDD 写入到HDFS 并并行读回,而不是从头开始重新计算.

You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to HDFS and read it back in parallel than recompute from scratch.

在spark2.1发布之前有一点不便;无法检查数据帧,因此您必须检查基础 RDD.这个问题已在 spark2.1 及以上版本中得到解决.

And there's a slight inconvenience prior to spark2.1 release; there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.

以镶木地板保存到磁盘并读取它的问题是

The problem with saving to Disk in parquet and read it back is that

  • 编码可能不方便.您需要多次保存和阅读.
  • 这可能是工作整体绩效的一个较慢的过程.因为当你保存为 parquet 并读取它时,Dataframe 需要再次重建.

这个 wiki 可能有助于进一步调查

This wiki could be useful for further investigation

如数据集检查点中所示wiki

检查点实际上是 Spark Core(Spark SQL 用于分布式计算)的一个特性,它允许驱动程序在失败时重新启动,并且之前计算的分布式计算状态被描述为 RDD.这已在 Spark Streaming 中成功使用 - 基于 RDD API 的用于流处理的现已过时的 Spark 模块.

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.

检查点会截断要设置检查点的 RDD 的谱系.这已在 Spark MLlib 中的迭代机器学习算法(如 ALS)中成功使用.

Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.

Spark SQL 中的数据集检查点使用检查点来截断被检查点的数据集的底层 RDD 的沿袭.

Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.

这篇关于Spark:数据帧检查点与显式写入磁盘的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆