Spark SQL:高速缓存的内存占用量通过“排序依据"提高 [英] Spark SQL: Cache Memory footprint improves with 'order by'

查看:125
本文介绍了Spark SQL:高速缓存的内存占用量通过“排序依据"提高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两种情况,其中我将23 GB分区为parquet数据,并且很少读取columns& caching提前触发一系列后续查询.

I have two scenarios where I have 23 GB partitioned parquet data and reading few of the columns & caching it upfront to fire a series of subsequent queries later on.

设置:

  • 集群:12节点EMR
  • 火花版本:1.6
  • 火花配置:默认
  • 运行配置:两种情况相同

案例1 :

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase")
dfMain.cache.count

SparkUI读取的输入数据为6.2 GB,缓存的对象为 15.1 GB .

From SparkUI, the input data read is 6.2 GB and the cached object is of 15.1 GB.

案例1 :

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase order by pogId")
dfMain.cache.count

SparkUI读取的输入数据为6.2 GB,缓存的对象为 5.5 GB .

From SparkUI, the input data read is 6.2 GB and the cached object is of 5.5 GB.

对此行为有任何解释或代码参考吗?

Any explanation, or code-reference to this behavior?

推荐答案

它实际上相对简单.如您在SQL指南中所读:

It is actually relatively simple. As you can read in the SQL guide:

Spark SQL可以使用内存中的列格式来缓存表... Spark SQL将仅扫描所需的列并自动调整压缩

Spark SQL can cache tables using an in-memory columnar format ... Spark SQL will scan only required columns and will automatically tune compression

关于排序的列式存储的妙处在于,它很容易在典型数据上进行压缩.排序时,您可以使用甚至非常简单的技术(例如 RLE .

Nice thing about sorted columnar storage is that it is very easy to compress on typical data. When you sort, you get these blocks of the similar records which can be squashed together using even very simple techniques like RLE.

此属性实际上在带有列存储的数据库中经常使用,因为它不仅在存储方面非常高效,而且在聚合方面也很有效.

This is a property that is actually used quite often in databases with columnar storage because it is not only very efficient in terms of storage but also aggregations.

Different aspects of the Spark columnar compression are covered by the sql.execution.columnar.compression package and as you can see RunLengthEncoding is indeed one of the available compressions schemes.

所以这里有两个部分:

Spark SQL将基于数据统计信息为每一列自动选择一个压缩编解码器.

Spark SQL will automatically select a compression codec for each column based on statistics of the data.

  • 排序可以将相似的记录聚在一起,从而使压缩效率更高.

  • sorting can cluster similar records together making compression much more efficient.

    如果列之间存在某些相关性(不是这样),那么即使是基于单个列的简单排序也会产生较大的影响,并提高不同压缩方案的性能.

    If there are some correlations between columns (when it is not the case?) even a simple sort based on a single column can have relatively large impact and improve the performance of different compression schemes.

    这篇关于Spark SQL:高速缓存的内存占用量通过“排序依据"提高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆