Spark:显式缓存会干扰 Catalyst 优化器优化某些查询的能力? [英] Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?

查看:19
本文介绍了Spark:显式缓存会干扰 Catalyst 优化器优化某些查询的能力?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习参加数据砖火花认证考试和他们的实践考试(请参阅> https://databricks-prod-cloudfront.cloud.databricks.com/public/793177bc53e528530b06c78a4fa0e086/0/6221173/100020/latest.html) 要求我们接受此陈述为真实事实:

I'm studying to take the data bricks to spark certification exam, and their practice exam ( please see > https://databricks-prod-cloudfront.cloud.databricks.com/public/793177bc53e528530b06c78a4fa0e086/0/6221173/100020/latest.html ) requires us to accept this statement as true fact:

显式缓存会通过干扰降低应用程序性能借助 Catalyst 优化器优化某些查询的能力"

"Explicit caching can decrease application performance by interfering with the Catalyst optimizer's ability to optimize some queries"

尽管我已经阅读了很多关于催化剂的内容并且对细节有很好的了解,但我还是弄错了这个问题.所以我想巩固我对这个主题的了解,并找到解释这个断言背后的方式和原因的来源.

I got this question wrong even though I have read up a lot on the catalyst and have a pretty good grasp of the details. So I wanted to shore up my knowledge of this topic and go to the source which explains the how's and why's behind this assertion.

谁能提供这方面的指导?具体来说,为什么会这样?我们如何确保当我们缓存数据集时,我们实际上不会妨碍优化器并使事情变得更糟?/谢谢!

Can anyone provide guidance about this? Specifically, why is this so? and how do we ensure that when we cache our datasets we are not actually getting in the way of the optimizer and making things worse? /Thanks!

推荐答案

缓存如何以及为什么会降低性能?

让我们用一个简单的例子来证明:

How and why can a cache decrease the performances ?

Let's use a simple example to demonstrate that :

// Some data
val df = spark.range(100)

df.join(df, Seq("id")).filter('id <20).explain(true)

在这里,催化剂计划将通过在加入之前对每个数据帧进行过滤来优化此加入,以减少将被打乱的数据量.

Here, the catalyst plan will optimize this join by doing a filter on each dataframe before joining, to reduce the amount of data that will get shuffled.

== Optimized Logical Plan ==
Project [id#0L]
+- Join Inner, (id#0L = id#69L)
   :- Filter (id#0L < 20)
   :  +- Range (0, 100, step=1, splits=Some(4))
   +- Filter (id#69L < 20)
      +- Range (0, 100, step=1, splits=Some(4))

如果我们在连接后缓存查询,查询将不会像我们在这里看到的那样优化:

If we cache the query after the join, the query won't be as optimized, as we can see here :

df.join(df, Seq("id")).cache.filter('id <20).explain(true)

== Optimized Logical Plan ==
Filter (id#0L < 20)
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      +- *Project [id#0L]
         +- *BroadcastHashJoin [id#0L], [id#74L], Inner, BuildRight
            :- *Range (0, 100, step=1, splits=4)
            +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
               +- *Range (0, 100, step=1, splits=4)

过滤器在最后完成...

The filter is done at the very end ...

为什么会这样?因为 cache 将数据帧写入磁盘.因此,每个后续查询都将使用这个缓存/写入磁盘 DataFrame,因此它只会优化缓存之后的查询部分.我们可以用同样的例子来检查!

Why so ? Because a cache writes on the disk the dataframe. So every consequent queries will use this cached / written on disk DataFrame, and so it will optimize only the part of the query AFTER the cache. We can check that with the same example !

df.join(df, Seq("id")).cache.join(df, Seq("id")).filter('id <20).explain(true)

== Optimized Logical Plan ==
Project [id#0L]
+- Join Inner, (id#0L = id#92L)
   :- Filter (id#0L < 20)
   :  +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   :        +- *Project [id#0L]
   :           +- *BroadcastHashJoin [id#0L], [id#74L], Inner, BuildRight
   :              :- *Range (0, 100, step=1, splits=4)
   :              +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :                 +- *Range (0, 100, step=1, splits=4)
   +- Filter (id#92L < 20)
      +- Range (0, 100, step=1, splits=Some(4))

过滤器在第二次加入之前完成,但在第一次之后完成,因为它被缓存了.

The filter is done before the second join, but after the first one because it is cached.

通过了解您的工作!您可以简单地比较催化剂计划并查看 Spark 缺少哪些优化.

By knowing what you do ! You can simply compares catalyst plans and see what optimizations Spark is missing.

这篇关于Spark:显式缓存会干扰 Catalyst 优化器优化某些查询的能力?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆