有什么火花检查点之间的区别,并坚持到磁盘 [英] What is the difference between spark checkpoint and persist to a disk

查看:357
本文介绍了有什么火花检查点之间的区别,并坚持到磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么火花检查点之间的区别,并坚持到磁盘。都是这些店在本地磁盘上?

What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?

推荐答案

有几个重要的差异,但最根本的是沿袭发生了什么。 坚持 / 缓存保持沿袭不变,同时检查点打破血统。让我们考虑下面的例子:

There are few important difference but the fundamental one is what happens with lineage. Persist / cache keeps lineage intact while checkpoint breaks lineage. Lets consider following examples:

import org.apache.spark.storage.StorageLevel

val rdd = sc.parallelize(1 to 10).map(x => (x % 3, 1)).reduceByKey(_ + _)


  • 缓存 / 坚持

    val indCache  = rdd.mapValues(_ > 4)
    indCache.persist(StorageLevel.DISK_ONLY)
    
    indCache.toDebugString
    // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated]
    //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated]
    //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated]
    //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]
    
    indCache.count
    // 3
    
    indCache.toDebugString
    // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated]
    //  |       CachedPartitions: 8; MemorySize: 0.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 587.0 B
    //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated]
    //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated]
    //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]
    


  • 检查点

    val indChk  = rdd.mapValues(_ > 4)
    indChk.checkpoint
    
    // indChk.toDebugString
    // (8) MapPartitionsRDD[11] at mapValues at <console>:24 []
    //  |  ShuffledRDD[3] at reduceByKey at <console>:21 []
    //  +-(8) MapPartitionsRDD[2] at map at <console>:21 []
    //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 []
    
    indChk.count
    // 3
    
    indChk.toDebugString
    // (8) MapPartitionsRDD[11] at mapValues at <console>:24 []
    //  |  ReliableCheckpointRDD[12] at count at <console>:27 []
    


  • 正如你在第一种情况下谱系看到的是,即使数据从缓存中提取pserved $ P $。这意味着数据可以从头开始重新计算,如果 indCache 的一些分区将丢失。在第二种情况下谱系是检查点之后完全失去了和 indChk 不进行再要重建它需要的信息。

    As you can see in the first case lineage is preserved even if data is fetched from the cache. It means that data can be recomputed from scratch if some partitions of indCache are lost. In the second case lineage is completely lost after the checkpoint and indChk doesn't carry an information required to rebuild it anymore.

    检查点,不像缓存 / 坚持计算独立于其他职位。这就是为什么RDD标记为检查点应该缓存:

    checkpoint, unlike cache / persist is computed separately from other jobs. That's why RDD marked for checkpointing should be cached:

    强烈建议,这RDD在内存中坚持,否则将其保存在一个文件都需要重新计算。

    It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

    最后检查点数据是永久性的,以后 SparkContext 不会被删除被破坏。

    Finally checkpointed data is persistent and not removed after SparkContext is destroyed.

    对于数据存储通过 RDD.checkpoint 使用 SparkContext.setCheckpointDir 要求 DFS 路径,如果在非本地模式运行。否则它可以是本地文件系统,以及。 localCheckpoint 坚持复制不应该使用本地文件系统。

    Regarding data storage SparkContext.setCheckpointDir used by RDD.checkpoint requires DFS path if running in non-local mode. Otherwise it can be local files system as well. localCheckpoint and persist without replication should use local file system.

    这篇关于有什么火花检查点之间的区别,并坚持到磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆