Spark SQL:通过“order by"改善缓存内存占用 [英] Spark SQL: Cache Memory footprint improves with 'order by'

查看:81
本文介绍了Spark SQL:通过“order by"改善缓存内存占用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个场景,我有 23 GB 分区 parquet 数据并读取很少的 columns &缓存 它预先触发一系列后续查询.

I have two scenarios where I have 23 GB partitioned parquet data and reading few of the columns & caching it upfront to fire a series of subsequent queries later on.

设置:

  • 集群:12 节点 EMR
  • Spark 版本:1.6
  • Spark 配置:默认
  • 运行配置:两种情况相同

案例 1:

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase")
dfMain.cache.count

SparkUI 读取的输入数据为 6.2 GB,缓存对象为 15.1 GB.

From SparkUI, the input data read is 6.2 GB and the cached object is of 15.1 GB.

案例 1:

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase order by pogId")
dfMain.cache.count

SparkUI,读取的输入数据为 6.2 GB,缓存对象为 5.5 GB.

From SparkUI, the input data read is 6.2 GB and the cached object is of 5.5 GB.

对此行为有任何解释或代码参考吗?

Any explanation, or code-reference to this behavior?

推荐答案

其实比较简单.正如您在 SQL 指南中所读到的:

It is actually relatively simple. As you can read in the SQL guide:

Spark SQL 可以使用内存中的列格式缓存表……Spark SQL 将仅扫描所需的列并自动调整压缩

Spark SQL can cache tables using an in-memory columnar format ... Spark SQL will scan only required columns and will automatically tune compression

排序列式存储的好处在于它很容易压缩典型数据.排序时,您会得到这些相似记录的块,它们甚至可以使用非常简单的技术(例如 RLE.

Nice thing about sorted columnar storage is that it is very easy to compress on typical data. When you sort, you get these blocks of the similar records which can be squashed together using even very simple techniques like RLE.

这是一个在具有列式存储的数据库中实际上经常使用的属性,因为它不仅在存储方面非常高效,而且在聚合方面也非常高效.

This is a property that is actually used quite often in databases with columnar storage because it is not only very efficient in terms of storage but also aggregations.

sql.execution.columnar.compression 包,如您所见 RunLengthEncoding 确实是可用的压缩方案之一.

Different aspects of the Spark columnar compression are covered by the sql.execution.columnar.compression package and as you can see RunLengthEncoding is indeed one of the available compressions schemes.

所以这里有两个部分:

Spark SQL 将根据数据的统计信息为每一列自动选择一个压缩编解码器.

Spark SQL will automatically select a compression codec for each column based on statistics of the data.

  • 排序可以将相似的记录聚集在一起,从而提高压缩效率.

  • sorting can cluster similar records together making compression much more efficient.

    如果列之间存在一些相关性(如果不是这种情况?)即使是基于单个列的简单排序也会产生相对较大的影响并提高不同压缩方案的性能.

    If there are some correlations between columns (when it is not the case?) even a simple sort based on a single column can have relatively large impact and improve the performance of different compression schemes.

    这篇关于Spark SQL:通过“order by"改善缓存内存占用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆