有什么火花检查点之间的区别,并坚持到磁盘 [英] What is the difference between spark checkpoint and persist to a disk
问题描述
有什么火花检查点之间的区别,并坚持到磁盘。都是这些店在本地磁盘上?
What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?
推荐答案
有几个重要的差异,但最根本的是沿袭发生了什么。 坚持
/ 缓存
保持沿袭不变,同时检查点
打破血统。让我们考虑下面的例子:
There are few important difference but the fundamental one is what happens with lineage. Persist
/ cache
keeps lineage intact while checkpoint
breaks lineage. Lets consider following examples:
import org.apache.spark.storage.StorageLevel
val rdd = sc.parallelize(1 to 10).map(x => (x % 3, 1)).reduceByKey(_ + _)
-
缓存
/坚持
:val indCache = rdd.mapValues(_ > 4) indCache.persist(StorageLevel.DISK_ONLY) indCache.toDebugString // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated] // | ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated] // +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated] // | ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated] indCache.count // 3 indCache.toDebugString // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated] // | CachedPartitions: 8; MemorySize: 0.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 587.0 B // | ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated] // +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated] // | ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]
-
检查点
:val indChk = rdd.mapValues(_ > 4) indChk.checkpoint // indChk.toDebugString // (8) MapPartitionsRDD[11] at mapValues at <console>:24 [] // | ShuffledRDD[3] at reduceByKey at <console>:21 [] // +-(8) MapPartitionsRDD[2] at map at <console>:21 [] // | ParallelCollectionRDD[1] at parallelize at <console>:21 [] indChk.count // 3 indChk.toDebugString // (8) MapPartitionsRDD[11] at mapValues at <console>:24 [] // | ReliableCheckpointRDD[12] at count at <console>:27 []
正如你在第一种情况下谱系看到的是,即使数据从缓存中提取pserved $ P $。这意味着数据可以从头开始重新计算,如果 indCache
的一些分区将丢失。在第二种情况下谱系是检查点之后完全失去了和 indChk
不进行再要重建它需要的信息。
As you can see in the first case lineage is preserved even if data is fetched from the cache. It means that data can be recomputed from scratch if some partitions of indCache
are lost. In the second case lineage is completely lost after the checkpoint and indChk
doesn't carry an information required to rebuild it anymore.
检查点
,不像缓存
/ 坚持
计算独立于其他职位。这就是为什么RDD标记为检查点应该缓存:
checkpoint
, unlike cache
/ persist
is computed separately from other jobs. That's why RDD marked for checkpointing should be cached:
强烈建议,这RDD在内存中坚持,否则将其保存在一个文件都需要重新计算。
It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
最后检查点
数据是永久性的,以后 SparkContext
不会被删除被破坏。
Finally checkpointed
data is persistent and not removed after SparkContext
is destroyed.
对于数据存储通过 RDD.checkpoint
使用 SparkContext.setCheckpointDir
要求 DFS
路径,如果在非本地模式运行。否则它可以是本地文件系统,以及。 localCheckpoint
和坚持
复制不应该使用本地文件系统。
Regarding data storage SparkContext.setCheckpointDir
used by RDD.checkpoint
requires DFS
path if running in non-local mode. Otherwise it can be local files system as well. localCheckpoint
and persist
without replication should use local file system.
这篇关于有什么火花检查点之间的区别,并坚持到磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!