Spark:数据帧检查点的效率与显式写入磁盘的效率 [英] Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

查看:141
本文介绍了Spark:数据帧检查点的效率与显式写入磁盘的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

检查点版本:

val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()

写入磁盘版本:

df.write.parquet(savePath)
val df = spark.read.parquet(savePath)

我认为两者都以相同的方式打破了血统.

I think both break the lineage in the same way.

在我的实验中,磁盘上的检查点要比镶木地板大30倍(689GB与24GB).就运行时间而言,检查点花费的时间是1.5倍(10.5分钟和7.5分钟).

In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).

考虑到所有这些,使用检查点而不是保存到文件的目的是什么?我想念什么吗?

Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?

推荐答案

检查点是截断RDD谱系图并将其保存到可靠的分布式(HDFS)或本地文件系统的过程.如果您有一个较大的RDD谱系图,并且想要冻结当前RDD的内容,即在继续下一步之前实现完整的RDD,则通常使用persist或checkpoint.然后,经过检查的RDD可以用于其他目的.

Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.

当您检查点时,RDD被序列化并存储在磁盘中.它不会以拼花形式存储,因此数据没有在磁盘中进行适当的存储优化.与拼花地板相反,拼花地板提供各种压缩和编码以存储优化数据.这可以解释大小上的差异.

When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.

  • 您绝对应该考虑在嘈杂的群集中建立检查点.如果有大量的工作和用户在争夺资源并且没有足够的资源同时运行所有工作,则群集称为嘈杂".

  • You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.

您必须考虑检查点,以确定您的计算是否真的很昂贵并且需要很长时间才能完成,因为将RDD写入其中可能会更快 HDFS并并行读取,而不是从头开始重新计算.

You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to HDFS and read it back in parallel than recompute from scratch.

在spark2.1发布之前有一点不便; 无法检查数据帧,因此必须检查基础RDD.此问题在spark2.1及更高版本中已解决.

And there's a slight inconvenience prior to spark2.1 release; there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.

保存到镶木地板中的磁盘并读回的问题是

The problem with saving to Disk in parquet and read it back is that

  • 在编码中可能会带来不便.您需要保存和读取多次.
  • 这可能会减慢工作的整体绩效.因为当您另存为实木复合地板并读回时,需要再次重建数据框.

Wiki 可能对进一步使用很有帮助调查

This wiki could be useful for further investigation

如数据集检查点所示 wiki

检查点实际上是Spark Core的一项功能(Spark SQL用于分布式计算),该功能允许驱动程序在发生故障时以先前计算为RDD的分布式计算状态重新启动.该功能已在Spark Streaming中成功使用,Spark Streaming是基于RDD API的流处理的现已过时的Spark模块.

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.

检查点会截断要检查点的RDD的沿袭.这已在Spark MLlib中成功用于迭代式机器学习算法(如ALS)中.

Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.

Spark SQL中的数据集检查点使用检查点来截断要检查的数据集的基础RDD的沿袭.

Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.

这篇关于Spark:数据帧检查点的效率与显式写入磁盘的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆