Spark SQL中缓存机制的区别 [英] Difference between Caching mechanism in Spark SQL

查看:28
本文介绍了Spark SQL中缓存机制的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试围绕 Spark SQL 中的各种缓存机制进行思考.以下代码片段之间是否有任何区别:

I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets:

cache table test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

方法 2:

create temporary view test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

cache table test_cache;

由于 Spark 中的计算是惰性的,Spark 会在方法 2 中第一次创建临时表时缓存结果吗?还是会等待对其应用任何收集?

Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any collect is applied to it ?

推荐答案

在 Spark SQL 中,如果直接使用 SQL 或使用 DataFrame DSL,缓存会有所不同.使用DSL,缓存是懒惰的,所以调用后

In Spark SQL there is a difference in caching if you use directly SQL or you use the DataFrame DSL. Using the DSL, the caching is lazy so after calling

my_df.cache()

数据不直接缓存在内存中,只是在查询计划中加入缓存信息,在DataFrame上调用一些动作后才会缓存数据.

the data is not cached in memory directly but only information about caching is added to the query plan and the data will be cached after calling some action on the DataFrame.

另一方面,如您在示例中所做的那样直接使用 SQL,默认情况下缓存是急切的.因此,在您的方法 1 中,作业将立即运行,并将数据放入内存中.在您的方法 2 中,将在使用缓存调用查询后运行作业:

On the other hand using directly SQL as you do in your example, the caching is eager by default. So in your Method 1 a job will run immediately and the data will be put to the memory. In your Method 2 a job will run after calling the query with cache:

cache table test_cache;

同样使用 SQL,也可以通过显式使用 lazy 关键字来延迟缓存:

Also using SQL, the caching can be made lazy as well by using lazy keyword explicitly:

cache lazy table test_cache;

在这种情况下,作业不会立即运行,数据将在对表 test_cache 调用某些操作后放入内存.

In this case a job will not run immediately and the data will be put into memory after calling some action against the table test_cache.

总而言之,您的两种方法在缓存方面是等效的,并且在运行代码块后数据将被急切地缓存.

To conclude, both your methods are equivalent in terms of caching and the data will be cached eagerly after running the block of the code.

这篇关于Spark SQL中缓存机制的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆