为什么我必须明确告诉Spark要缓存什么? [英] Why do I have to explicitly tell Spark what to cache?

查看:102
本文介绍了为什么我必须明确告诉Spark要缓存什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark中,每次我们对RDD执行任何操作时,都会重新计算RDD.因此,如果我们知道将要重用RDD,则应该显式缓存RDD.

比方说,Spark决定懒惰地缓存所有RDD,并使用LRU自动将最相关的RDD保留在内存中(这是大多数缓存的工作方式).这将对开发人员有很大帮助,因为他不必考虑缓存并专注于应用程序.同样,我也看不到它如何对性能产生负面影响,因为很难跟踪程序中使用了多少次变量(RDD),因此大多数程序员都会决定以任何方式缓存大多数RDD.

通常会自动进行缓存.以OS/平台,框架或工具为例.但是由于分布式计算中缓存的复杂性,我可能会丢失为什么不能自动进行缓存或对性能造成影响的原因.

所以我不明白为什么我必须显式地缓存为

  1. 看起来很丑
  2. 很容易错过
  3. 使用起来容易/过度使用

解决方案

主观原因列表:

    在实践中,很少需要缓存,缓存主要用于迭代算法,可以打破较长的沿袭.例如,典型的ETL管道可能根本不需要缓存. 缓存大多数RDD 绝对不是正确的选择.
  • 没有通用的缓存策略.实际选择取决于可用资源,例如内存量,磁盘(本地,远程,存储服务),文件系统(内存中,磁盘上)和特定应用程序.
  • 磁盘上的持久性非常昂贵,内存持久性给JVM带来了更多压力,并使用了Spark中最有价值的资源
  • 不对应用程序语义进行假设就不可能自动缓存.特别是:

    • 数据源更改时的预期行为.没有通用的答案,并且在许多情况下不可能自动跟踪更改
    • 区分确定性转换和非确定性转换,并在缓存和重新计算之间进行选择
  • 将Spark缓存与OS级别的缓存进行比较是没有意义的.操作系统缓存的主要目标是减少延迟.在Spark中,延迟通常不是最重要的因素,并且缓存用于其他目的,如一致性,正确性和减轻系统不同部分的压力.
  • 如果缓存不使用堆外存储,则缓存会给垃圾收集器带来更多压力.实际上,GC成本可能高于重新计算数据的成本.
  • 取决于数据和缓存方法,从内存读取数据的效率可能会大大降低.
  • 缓存会干扰Spark SQL中可用的更高级的优化,从而有效地禁用分区修剪或谓词和投影下推.

还值得注意的是:

  • 使用LRU自动处理删除缓存的数据
  • 某些数据(例如中间洗牌数据)会自动保留.我承认这使前面的某些论点至少部分无效.
  • 火花缓存不会影响系统级别或JVM级别的机制

In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.

Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.

Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.

So I fail to understand, why I have to explicitly cache as,

  1. It looks ugly
  2. It can easily be missed
  3. It can easily be over/under used

解决方案

A subjective list of reasons:

  • in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
  • there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application.
  • on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
  • it is impossible to cache automatically without making assumptions about application semantics. In particular:

    • expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
    • differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
  • comparing Spark caching to OS level caching doesn't make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
  • if cache doesn't use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.
  • depending on the data and caching method reading data from cache can be significantly less efficient memory-wise.
  • Caching interferes with more advanced optimizations available in Spark SQL, effectively disabling partition pruning or predicate and projection pushdown.

It is also worth to note that:

  • removing cached data is handled automatically using LRU
  • some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
  • Spark caching doesn't affect system level or JVM level mechanisms

这篇关于为什么我必须明确告诉Spark要缓存什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆