Spark:显式缓存会干扰Catalyst优化器优化某些查询的能力吗? [英] Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?

查看:89
本文介绍了Spark:显式缓存会干扰Catalyst优化器优化某些查询的能力吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究以数据砖为基础进行认证考试及其实践考试(请参阅>

I'm studying to take the data bricks to spark certification exam, and their practice exam ( please see > https://databricks-prod-cloudfront.cloud.databricks.com/public/793177bc53e528530b06c78a4fa0e086/0/6221173/100020/latest.html ) requires us to accept this statement as true fact:

显式缓存可能会通过干扰来降低应用程序性能 借助Catalyst优化器优化某些查询的能力"

"Explicit caching can decrease application performance by interfering with the Catalyst optimizer's ability to optimize some queries"

即使我已经读了很多关于催化剂的文章,并且对细节有很好的理解,但我还是错了这个问题.因此,我想丰富我对该主题的知识,并转到源头,该源头解释了该主张的背后原因和原因.

I got this question wrong even though I have read up a lot on the catalyst and have a pretty good grasp of the details. So I wanted to shore up my knowledge of this topic and go to the source which explains the how's and why's behind this assertion.

任何人都可以提供有关此方面的指导吗?具体来说,为什么会这样呢?以及如何确保在缓存数据集时不会真正妨碍优化器并使情况变得更糟?/谢谢!

Can anyone provide guidance about this? Specifically, why is this so? and how do we ensure that when we cache our datasets we are not actually getting in the way of the optimizer and making things worse? /Thanks!

推荐答案

缓存如何以及为什么会降低性能?

让我们用一个简单的例子来证明这一点:

How and why can a cache decrease the performances ?

Let's use a simple example to demonstrate that :

// Some data
val df = spark.range(100)

df.join(df, Seq("id")).filter('id <20).explain(true)

在这里,催化剂计划将通过在联接之前在每个数据帧上执行过滤器来优化此联接,以减少将被拖曳的数据量.

Here, the catalyst plan will optimize this join by doing a filter on each dataframe before joining, to reduce the amount of data that will get shuffled.

== Optimized Logical Plan ==
Project [id#0L]
+- Join Inner, (id#0L = id#69L)
   :- Filter (id#0L < 20)
   :  +- Range (0, 100, step=1, splits=Some(4))
   +- Filter (id#69L < 20)
      +- Range (0, 100, step=1, splits=Some(4))

如果我们在连接后缓存查询,查询将不会像在这里看到的那样优化:

If we cache the query after the join, the query won't be as optimized, as we can see here :

df.join(df, Seq("id")).cache.filter('id <20).explain(true)

== Optimized Logical Plan ==
Filter (id#0L < 20)
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      +- *Project [id#0L]
         +- *BroadcastHashJoin [id#0L], [id#74L], Inner, BuildRight
            :- *Range (0, 100, step=1, splits=4)
            +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
               +- *Range (0, 100, step=1, splits=4)

过滤器在最后完成...

The filter is done at the very end ...

为什么这样?因为cache在磁盘上写入数据帧.因此,每个后续查询都将使用此缓存/写在磁盘DataFrame上的数据,因此它将仅在缓存之后优化查询的一部分.我们可以使用相同的示例进行检查!

Why so ? Because a cache writes on the disk the dataframe. So every consequent queries will use this cached / written on disk DataFrame, and so it will optimize only the part of the query AFTER the cache. We can check that with the same example !

df.join(df, Seq("id")).cache.join(df, Seq("id")).filter('id <20).explain(true)

== Optimized Logical Plan ==
Project [id#0L]
+- Join Inner, (id#0L = id#92L)
   :- Filter (id#0L < 20)
   :  +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   :        +- *Project [id#0L]
   :           +- *BroadcastHashJoin [id#0L], [id#74L], Inner, BuildRight
   :              :- *Range (0, 100, step=1, splits=4)
   :              +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :                 +- *Range (0, 100, step=1, splits=4)
   +- Filter (id#92L < 20)
      +- Range (0, 100, step=1, splits=Some(4))

该过滤器在第二个联接之前完成,但在第一次联接之后完成,因为已缓存.

The filter is done before the second join, but after the first one because it is cached.

通过了解您的工作!您可以简单地比较催化剂计划,看看Spark缺少哪些优化.

By knowing what you do ! You can simply compares catalyst plans and see what optimizations Spark is missing.

这篇关于Spark:显式缓存会干扰Catalyst优化器优化某些查询的能力吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆