Spark SQL中的缓存机制之间的区别 [英] Difference between Caching mechanism in Spark SQL

查看:211
本文介绍了Spark SQL中的缓存机制之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图围绕Spark SQL中的各种缓存机制进行研究.以下代码段之间是否有任何区别:

I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets:

cache table test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

方法2:

create temporary view test_cache AS
select a, b, c
from x
inner join y
on x.a = y.a;

cache table test_cache;

由于Spark中的计算是惰性的,因此在方法2中第一次创建临时表时,Spark会缓存结果吗?还是会等待对其应用任何收集?

Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any collect is applied to it ?

推荐答案

在Spark SQL中,如果直接使用SQL或使用DataFrame DSL,则缓存有所不同.使用DSL,缓存是惰性的,因此在调用之后

In Spark SQL there is a difference in caching if you use directly SQL or you use the DataFrame DSL. Using the DSL, the caching is lazy so after calling

my_df.cache()

数据不会直接缓存在内存中,而是只会将有关缓存的信息添加到查询计划中,并且在对DataFrame调用某些操作后将缓存数据.

the data is not cached in memory directly but only information about caching is added to the query plan and the data will be cached after calling some action on the DataFrame.

另一方面,如您在示例中一样,直接使用SQL,默认情况下渴望进行缓存.因此,在方法1中,作业将立即运行,并且数据将被存储到内存中.在方法2中,使用缓存调用查询后,将运行作业:

On the other hand using directly SQL as you do in your example, the caching is eager by default. So in your Method 1 a job will run immediately and the data will be put to the memory. In your Method 2 a job will run after calling the query with cache:

cache table test_cache;

还使用SQL,也可以通过显式使用 lazy 关键字将缓存设置为惰性:

Also using SQL, the caching can be made lazy as well by using lazy keyword explicitly:

cache lazy table test_cache;

在这种情况下,作业将不会立即运行,并且对表 test_cache 调用某些操作后,数据将被存入内存.

In this case a job will not run immediately and the data will be put into memory after calling some action against the table test_cache.

总而言之,这两种方法在缓存方面都是等效的,并且在运行代码块后将急切地缓存数据.

To conclude, both your methods are equivalent in terms of caching and the data will be cached eagerly after running the block of the code.

这篇关于Spark SQL中的缓存机制之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆