使用 spark-sql 缓存临时表 [英] Temp table caching with spark-sql

查看:264
本文介绍了使用 spark-sql 缓存临时表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 registerTempTable (createOrReplaceTempViewspark 2.+) 注册的表是否已缓存?

Is a table registered with registerTempTable (createOrReplaceTempView with spark 2.+) cached?

使用 Zeppelin,我在我的 Scala 代码中注册了一个 DataFrame,经过大量计算,然后在 %pyspark 中我想访问它,并进一步过滤它.

Using Zeppelin, I register a DataFrame in my scala code, after heavy computation, and then within %pyspark I want to access it, and further filter it.

它会使用表的内存缓存版本吗?还是每次都会重建?

Will it use a memory-cached version of the table? Or will it be rebuilt each time?

推荐答案

已注册的表不会缓存在内存中.

Registered tables are not cached in memory.

registerTempTable createOrReplaceTempView 方法将使用给定的查询创建或替换给定 DataFrame 的视图计划.

The registerTempTable createOrReplaceTempView method will just create or replace a view of the given DataFrame with a given query plan.

如果我们需要创建永久视图,它会将查询计划转换为规范化的 SQL 字符串,并将其作为视图文本存储在 Metastore 中.

It will convert the query plan to canonicalized SQL string, and store it as view text in metastore, if we need to create a permanent view.

您需要明确缓存您的 DataFrame.例如:

You'll need to cache your DataFrame explicitly. e.g :

df.createOrReplaceTempView("my_table") # df.registerTempTable("my_table") for spark <2.+
spark.cacheTable("my_table") 

让我们用一个例子来说明这一点:

Let's illustrate this with an example :

使用cacheTable:

scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> sc.getPersistentRDDs
// res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> df.createOrReplaceTempView("my_table")

scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> spark.catalog.cacheTable("my_table") // spark.cacheTable("...") before spark 2.0

scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(2 -> In-memory table my_table MapPartitionsRDD[2] at cacheTable at <console>:26)

现在相同的例子使用cache.registerTempTable cache.createOrReplaceTempView :

Now the same example using cache.registerTempTable cache.createOrReplaceTempView :

scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> df.createOrReplaceTempView("my_table")

scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()

scala> df.cache.createOrReplaceTempView("my_table")

scala> sc.getPersistentRDDs
// res6: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = 
// Map(2 -> ConvertToUnsafe
// +- LocalTableScan [_1#0,_2#1], [[1,2],[b,3]]
//  MapPartitionsRDD[2] at cache at <console>:28)

这篇关于使用 spark-sql 缓存临时表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆