Spark是否支持使用Parquet文件进行分区修剪? [英] Does Spark support Partition Pruning with Parquet Files

查看:595
本文介绍了Spark是否支持使用Parquet文件进行分区修剪?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个大型数据集,该数据集由两列 - plant_name tag_id 分区。第二个分区 - tag_id 具有200000个唯一值,我主要通过特定的 tag_id 值访问数据。如果我使用以下Spark命令:

  sqlContext.setConf(spark.sql.hive.metastorePartitionPruning,true) 
sqlContext.setConf(spark.sql.parquet.filterPushdown,true)
val df = sqlContext.sql(select * from tag_data where plant_name ='PLANT01'and tag_id ='1000' )

我希望得到快速响应,因为这解决了单个分区问题。在Hive和Presto中,这需要花费几秒钟,但是在Spark中它会运行数小时。



实际数据保存在S3存储桶中,当我提交sql查询时,Spark首先获取Hive元存储的所有分区(其中200000个分区),然后调用 refresh()来强制在S3中完成所有这些文件的状态列表对象存储(实际上调用 listLeafFilesInParallel )。



这两个操作非常昂贵,是否有任何设置那么可以让Spark更早地修剪分区 - 无论是在调用元数据存储期间,还是之后立即?

解决方案

是的,spark支持分区修剪。

Spark创建分区目录列表(顺序或并行 listLeafFilesInParallel )以构建所有分区第一次缓存周围。同一应用程序中的查询,即扫描数据利用此缓存。所以你看到的缓慢可能是因为这个缓存构建。后续扫描数据的查询利用缓存来修剪分区。



这些日志显示分区被列出来填充缓存。

 应用程序> 16/11/14 10:45:24 main INFO ParquetRelation:列出s3:// test-bucket / test_parquet_pruning / month = 2015-01 on driver 
App> 16/11/14 10:45:24 main INFO ParquetRelation:在驱动程序
上列出s3:// test-bucket / test_parquet_pruning / month = 2015-02 App> 16/11/14 10:45:24 main INFO ParquetRelation:列出s3:// test-bucket / test_parquet_pruning / month = 2015-03 on driver

这些是显示修剪正在发生的日志。

 应用程序> 16/11/10 12:29:16 main INFO DataSourceStrategy:选择20个分区中的1个分区,修剪95.0%的分区。 

请参阅 convertToParquetRelation getHiveQlPartitions HiveMetastoreCatalog.scala 中。


I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:

sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")

I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.

The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).

It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?

解决方案

Yes, spark supports partition pruning.

Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.

These are the logs which shows partitions being listed to populate the cache.

App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver

These are the logs showing pruning is happening.

App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.

Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.

这篇关于Spark是否支持使用Parquet文件进行分区修剪?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆