分区不能被修剪简单SparkSQL查询 [英] Partitions not being pruned in simple SparkSQL queries

查看:537
本文介绍了分区不能被修剪简单SparkSQL查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想有效地选择从SparkSQL表中各个分区(实木复合地板中的S3)。但是,我看到星火打开表中的所有实木复合地板的文件证据,而不仅仅是通过过滤器。这使得即使是小规模的查询昂贵有大量的分区表。

I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions.

下面是一个说明性的例子。我用SparkSQL和蜂巢metastore在S3上创建了一个简单的分区表:

Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore:

# Make some data
df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 
                       'k': ['a', 'e', 'i', 'o', 'u']*3, 
                       'v': range(15)})
# Convert to a SparkSQL DataFrame
sdf = hiveContext.createDataFrame(df)
# And save it
sdf.write.partitionBy('pk').saveAsTable('dataset',
                                        format='parquet',
                                        path='s3a://bucket/dataset')

在随后的会议上,我要选择此表的子集:

In a subsequent session, I want to select a subset of this table:

dataset = hiveContext.table('dataset')
filtered_dataset = dataset.filter(dataset.pk == 'b')
print filtered_dataset.toPandas()

在正在打印随后的日志,我看到修剪的应该的发生:

In the subsequent logs being printed, I see that pruning is supposed to be happening:

15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.

但后来我看到的所有分区被打开实木复合地板的文件:

But then I see parquet files being opened from all partitions:

15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152

使用只有三个分区,这是没有问题的,但---数千,它会导致明显的延迟。为什么所有这些无关紧要的文件被打开?

With only three partitions, this isn't a problem---but with thousands, it causes noticeable delay. Why are all these irrelevant files being opened?

推荐答案

看看 spark.sql.parquet.filterPushdown ,默认设置为<$ C $因为一些错误present在星火使用实木复合地板C版>假。 它可能会在1.3 / 1.4的使用,查看官方文档

Take a look at spark.sql.parquet.filterPushdown, by default set to false because of some bugs present in the Parquet version that Spark uses. It may be possible to use in 1.3/1.4, check the official documentation.

我觉得这是固定在星火1.5。

I think this is fixed in Spark 1.5.

这篇关于分区不能被修剪简单SparkSQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆