SparkSQL中的惰性评估 [英] Lazy Evaluation in SparkSQL

查看:162
本文介绍了SparkSQL中的惰性评估的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark编程指南中的这段代码中

# The result of loading a parquet file is also a DataFrame.
parquetFile = sqlContext.read.parquet("people.parquet")

# Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile");
teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.collect()

执行每一行时,Java堆中到底发生了什么(如何管理Spark内存)?

What exactly happens in the Java heap (how is the Spark memory managed) when each line is executed?

我特别有这些问题

  1. sqlContext.read.parquet是不是很懒?是否会使整个镶木地板文件加载到内存中?
  2. 在执行collect操作时,对于要应用的SQL查询,

  1. Is sqlContext.read.parquet lazy? Does it cause the whole parquet file to be loaded in memory?
  2. When the collect action is executed, for the SQL query to be applied,

a.是整个实木复合地板首先存储为RDD,然后进行处理或

a. is the entire parquet first stored as an RDD and then processed or

b.是先处理实木复合地板文件以仅选择name列,然后将其存储为RDD,然后由Spark根据age条件进行过滤吗?

b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?

推荐答案

sqlContext.read.parquet是不是很懒?

Is sqlContext.read.parquet lazy?

是的,默认情况下,spark中的所有转换都是惰性的.

yes,By default all transformations in spark are lazy.

执行collect操作时,要应用SQL查询

When the collect action is executed, for the SQL query to be applied

a.是整个实木复合地板首先存储为RDD,然后进行处理或

a. is the entire parquet first stored as an RDD and then processed or

b.实木复合地板文件是否首先经过处理以仅选择名称列,然后存储为RDD,然后根据年龄条件由Spark进行过滤?

b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?

在每个动作中火花都会生成新的RDD. Parquet也是一种柱状格式,Parquet读取器使用下推式过滤器来进一步减少磁盘IO. 下推式过滤器允许在将数据读入Spark之前就做出早期的数据选择决策. 因此,只有部分文件会被加载到内存中.

On each action spark will generate new RDD. Also Parquet is a columnar format, Parquet readers used push-down filters to further reduce disk IO. Push-down filters allow early data selection decisions to be made before data is even read into Spark. So only part of the file will be loaded into memory.

这篇关于SparkSQL中的惰性评估的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆