在Apache Spark中，我可以增量缓存RDD分区吗? [英] In Apache Spark, can I incrementally cache an RDD partition?

查看：86 发布时间：2021/4/8 20:14:28 apache-spark rdd persistent-storage

本文介绍了在Apache Spark中，我可以增量缓存RDD分区吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的印象是RDD的执行和缓存都是惰性的:也就是说，如果缓存了RDD，并且仅使用了一部分，则缓存机制将仅缓存该部分，而另一部分将被计算按需.

I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand.

不幸的是，以下实验似乎表明并非如此:

Unfortunately, the following experiment seems to indicate otherwise:

      val acc = new LongAccumulator()
      TestSC.register(acc)

      val rdd = TestSC.parallelize(1 to 100, 16).map { v =>
        acc add 1
        v
      }

      rdd.persist()

      val sliced = rdd
        .mapPartitions { itr =>
          itr.slice(0, 2)
        }

      sliced.count()

      assert(acc.value == 32)

运行它会产生以下异常:

Running it yields the following exception:

100 did not equal 32
ScalaTestFailureLocation: 
Expected :32
Actual   :100

结果是，整个RDD的计算结果，而不是每个分区中仅前2个项目的计算结果.在某些情况下(例如，当您需要确定RDD是否快速为空时)，这是非常低效的.理想情况下，缓存管理器应允许对缓存缓冲区进行增量写入并随机访问，此功能是否存在?如果没有，我应该怎么做才能做到这一点?(最好使用现有的内存和磁盘缓存机制)

Turns out the entire RDD was computed instead of only the first 2 items in each partition. This is very inefficient in some cases (e.g. when you need to determine whether the RDD is empty quickly). Ideally, the caching manager should allow the caching buffer to be incrementally written and accessed randomly, does this feature exists? If not, what should I do to make it happen? (preferrably using existing memory & disk caching mechanism)

非常感谢您的意见

更新1 似乎Spark已经有2个类:

UPDATE 1 It appears that Spark already has 2 classes:

ExternalAppendOnlyMap
ExternalAppendOnlyUnsafeRowArray

支持更精细地缓存许多值.更好的是，它们不依赖于StorageLevel，而是自己决定要使用哪个存储设备.但是，令我感到惊讶的是，它们不是直接用于RDD/Dataset缓存的选项，而不是用于co-group/join/streamOps或累加器的选项.

that supports more granular caching of many values. Even better, they don't rely on StorageLevel, instead make its own decision which storage device to use. I'm however surprised that they are not options for RDD/Dataset caching directly, rather than for co-group/join/streamOps or accumulators.

在Apache Spark中，我可以增量缓存RDD分区吗? [英] In Apache Spark, can I incrementally cache an RDD partition?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Apache Spark中，我可以增量缓存RDD分区吗? [英] In Apache Spark, can I incrementally cache an RDD partition?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭