Spark 是否支持使用 Parquet 文件进行分区修剪 [英] Does Spark support Partition Pruning with Parquet Files

查看:34
本文介绍了Spark 是否支持使用 Parquet 文件进行分区修剪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个大型数据集,该数据集由两列分区 - plant_nametag_id.第二个分区 - tag_id 有 200000 个唯一值,我主要通过特定的 tag_id 值访问数据.如果我使用以下 Spark 命令:

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:

sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")

我希望快速响应,因为这会解析为单个分区.在 Hive 和 Presto 中,这需要几秒钟,但在 Spark 中它会运行几个小时.

I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.

实际数据保存在 S3 存储桶中,当我提交 sql 查询时,Spark 关闭并首先从 Hive 元存储中获取所有分区(其中 200000 个),然后调用 refresh() 强制 S3 对象存储中所有这些文件的完整状态列表(实际上调用 listLeafFilesInParallel).

The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).

正是这两个操作如此昂贵,是否有任何设置可以让 Spark 更早地修剪分区 - 无论是在调用元数据存储期间,还是之后立即?

It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?

推荐答案

是的,spark 支持分区修剪.

Yes, spark supports partition pruning.

Spark 会列出分区目录(顺序或并行 listLeafFilesInParallel),以便第一次构建所有分区的缓存.同一应用程序中的查询,即扫描数据利用此缓存.所以你看到的缓慢可能是因为这个缓存构建.扫描数据的后续查询使用缓存来修剪分区.

Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.

这些是显示列出的分区以填充缓存的日志.

These are the logs which shows partitions being listed to populate the cache.

App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver

这些是显示正在修剪的日志.

These are the logs showing pruning is happening.

App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.

参考HiveMetastoreCatalog.scala中的convertToParquetRelationgetHiveQlPartitions.

这篇关于Spark 是否支持使用 Parquet 文件进行分区修剪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆