spark检查点和持久化到磁盘有什么区别 [英] What is the difference between spark checkpoint and persist to a disk

查看:31
本文介绍了spark检查点和持久化到磁盘有什么区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花检查点和持久化到磁盘有什么区别.这些都存储在本地磁盘中吗?

What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?

推荐答案

几乎没有什么重要的区别,但最根本的区别是血统.Persist/cache 保持沿袭完整,而 checkpoint 破坏沿袭.让我们考虑以下示例:

There are few important differences but the fundamental one is what happens with lineage. Persist / cache keeps lineage intact while checkpoint breaks lineage. Lets consider following examples:

import org.apache.spark.storage.StorageLevel

val rdd = sc.parallelize(1 to 10).map(x => (x % 3, 1)).reduceByKey(_ + _)

  • cache/persist:

    val indCache  = rdd.mapValues(_ > 4)
    indCache.persist(StorageLevel.DISK_ONLY)
    
    indCache.toDebugString
    // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated]
    //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated]
    //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated]
    //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]
    
    indCache.count
    // 3
    
    indCache.toDebugString
    // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated]
    //  |       CachedPartitions: 8; MemorySize: 0.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 587.0 B
    //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated]
    //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated]
    //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]
    

  • 检查点:

    val indChk  = rdd.mapValues(_ > 4)
    indChk.checkpoint
    
    indChk.toDebugString
    // (8) MapPartitionsRDD[11] at mapValues at <console>:24 []
    //  |  ShuffledRDD[3] at reduceByKey at <console>:21 []
    //  +-(8) MapPartitionsRDD[2] at map at <console>:21 []
    //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 []
    
    indChk.count
    // 3
    
    indChk.toDebugString
    // (8) MapPartitionsRDD[11] at mapValues at <console>:24 []
    //  |  ReliableCheckpointRDD[12] at count at <console>:27 []
    

  • 如您所见,在第一种情况下,即使从缓存中获取数据,也会保留沿袭.这意味着如果 indCache 的某些分区丢失,可以从头开始重新计算数据.在第二种情况下,在检查点之后谱系完全丢失,indChk 不再携带重建它所需的信息.

    As you can see, in the first case lineage is preserved even if data is fetched from the cache. It means that data can be recomputed from scratch if some partitions of indCache are lost. In the second case lineage is completely lost after the checkpoint and indChk doesn't carry an information required to rebuild it anymore.

    checkpointcache/persist 不同,它与其他作业分开计算.这就是为什么标记为检查点的 RDD 应该被缓存:

    checkpoint, unlike cache / persist is computed separately from other jobs. That's why RDD marked for checkpointing should be cached:

    强烈建议将此RDD持久化在内存中,否则保存在文件中需要重新计算.

    It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

    最后 checkpointed 数据是持久的,并且在 SparkContext 被销毁后不会被删除.

    Finally checkpointed data is persistent and not removed after SparkContext is destroyed.

    关于RDD.checkpoint使用的数据存储SparkContext.setCheckpointDir,如果在非本地模式下运行需要DFS路径.否则,它也可以是本地文件系统.localCheckpointpersist 没有复制应该使用本地文件系统.

    Regarding data storage SparkContext.setCheckpointDir used by RDD.checkpoint requires DFS path if running in non-local mode. Otherwise it can be local files system as well. localCheckpoint and persist without replication should use local file system.

    重要提示:

    RDD 检查点与 Spark Streaming 中的检查点是不同的概念.前者旨在解决血统问题,后者则是关于流媒体可靠性和故障恢复.

    RDD checkpointing is a different concept than a chekpointing in Spark Streaming. The former one is designed to address lineage issue, the latter one is all about streaming reliability and failure recovery.

    这篇关于spark检查点和持久化到磁盘有什么区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆