为什么我必须明确告诉 Spark 缓存什么? [英] Why do I have to explicitly tell Spark what to cache?

查看:26
本文介绍了为什么我必须明确告诉 Spark 缓存什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 中,每次我们对 RDD 执行任何操作时,都会重新计算 RDD.所以如果我们知道 RDD 将被重用,我们应该显式缓存 RDD.

比方说,Spark 决定懒惰地缓存所有 RDD,并使用 LRU 自动将最相关的 RDD 保留在内存中(这就是大多数缓存的工作方式).这将对开发人员有很大帮助,因为他不必考虑缓存并专注于应用程序.此外,我不知道它会对性能产生什么负面影响,因为很难跟踪程序内部使用变量 (RDD) 的次数,大多数程序员会决定以任何方式缓存大部分 RDD.

缓存通常会自动发生.以操作系统/平台或框架或工具为例.但是由于分布式计算中缓存的复杂性,我可能会忽略为什么缓存不能自动或性能影响.

所以我不明白,为什么我必须显式缓存为,

  1. 看起来很丑
  2. 很容易错过
  3. 它很容易被过度使用/使用不足

解决方案

主观原因列表:

  • 在实践中很少需要缓存,并且主要用于迭代算法,打破长的血统.例如,典型的 ETL 管道可能根本不需要缓存.缓存大部分 RDD 绝对不是正确的选择.
  • 没有通用的缓存策略.实际选择取决于可用资源,例如内存量、磁盘(本地、远程、存储服务)、文件系统(内存中、磁盘上)和特定应用程序.
  • 磁盘上的持久性很昂贵,内存持久性会给 JVM 带来更多压力,并且正在使用 Spark 中最有价值的资源
  • 如果不对应用程序语义做出假设,就不可能自动缓存.特别是:

    • 数据源更改时的预期行为.没有统一的答案,而且在许多情况下不可能自动跟踪变化
    • 区分确定性和非确定性转换并在缓存和重新计算之间进行选择
  • 将 Spark 缓存与操作系统级缓存进行比较是没有意义的.操作系统缓存的主要目标是减少延迟.在 Spark 中,延迟通常不是最重要的因素,缓存用于其他目的,例如一致性、正确性和减少系统不同部分的压力.
  • 如果缓存不使用堆外存储,那么缓存会给垃圾收集器带来额外的压力.GC 成本实际上可能高于重新计算数据的成本.
  • 根据数据和缓存方法,从缓存中读取数据的内存效率可能会显着降低.
  • 缓存会干扰 Spark SQL 中可用的更高级优化,从而有效地禁用分区修剪或谓词和投影下推.

还值得注意的是:

  • 使用 LRU 自动处理删除缓存数据
  • 一些数据(如中间混洗数据)会自动持久化.我承认这使之前的一些论点至少部分无效.
  • Spark 缓存不影响系统级或 JVM 级机制

In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.

Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.

Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.

So I fail to understand, why I have to explicitly cache as,

  1. It looks ugly
  2. It can easily be missed
  3. It can easily be over/under used

解决方案

A subjective list of reasons:

  • in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
  • there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application.
  • on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
  • it is impossible to cache automatically without making assumptions about application semantics. In particular:

    • expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
    • differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
  • comparing Spark caching to OS level caching doesn't make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
  • if cache doesn't use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.
  • depending on the data and caching method reading data from cache can be significantly less efficient memory-wise.
  • Caching interferes with more advanced optimizations available in Spark SQL, effectively disabling partition pruning or predicate and projection pushdown.

It is also worth to note that:

  • removing cached data is handled automatically using LRU
  • some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
  • Spark caching doesn't affect system level or JVM level mechanisms

这篇关于为什么我必须明确告诉 Spark 缓存什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆