在简单的 SparkSQL 查询中未修剪分区 [英] Partitions not being pruned in simple SparkSQL queries

查看:23
本文介绍了在简单的 SparkSQL 查询中未修剪分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 SparkSQL 表(S3 中的镶木地板)中有效地选择单个分区.但是,我看到 Spark 打开表中所有镶木地板文件的证据,而不仅仅是那些通过过滤器的文件.这使得对于具有大量分区的表来说,即使是小查询也很昂贵.

I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions.

这是一个说明性示例.我使用 SparkSQL 和 Hive 元存储在 S3 上创建了一个简单的分区表:

Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore:

# Make some data
df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 
                       'k': ['a', 'e', 'i', 'o', 'u']*3, 
                       'v': range(15)})
# Convert to a SparkSQL DataFrame
sdf = hiveContext.createDataFrame(df)
# And save it
sdf.write.partitionBy('pk').saveAsTable('dataset',
                                        format='parquet',
                                        path='s3a://bucket/dataset')

在随后的会话中,我想选择此表的一个子集:

In a subsequent session, I want to select a subset of this table:

dataset = hiveContext.table('dataset')
filtered_dataset = dataset.filter(dataset.pk == 'b')
print filtered_dataset.toPandas()

在打印的后续日志中,我看到修剪应该正在发生:

In the subsequent logs being printed, I see that pruning is supposed to be happening:

15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.

但后来我看到从所有分区打开了镶木地板文件:

But then I see parquet files being opened from all partitions:

15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152

只有三个分区,这不是问题---但是如果有数千个,它会导致明显的延迟.为什么这些不相关的文件都被打开了?

With only three partitions, this isn't a problem---but with thousands, it causes noticeable delay. Why are all these irrelevant files being opened?

推荐答案

看看spark.sql.parquet.filterPushdown,默认设置为false,因为Spark 使用的 Parquet 版本中存在一些错误.1.3/1.4可能可以使用,查看官方文档.

Take a look at spark.sql.parquet.filterPushdown, by default set to false because of some bugs present in the Parquet version that Spark uses. It may be possible to use in 1.3/1.4, check the official documentation.

我认为这在 Spark 1.5 中已修复.

I think this is fixed in Spark 1.5.

这篇关于在简单的 SparkSQL 查询中未修剪分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆