分区不能被修剪简单SparkSQL查询 [英] Partitions not being pruned in simple SparkSQL queries

查看：537 发布时间：2015/12/1 10:35:23 amazon-s3 apache-spark apache-spark-sql pyspark parquet

本文介绍了分区不能被修剪简单SparkSQL查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想有效地选择从SparkSQL表中各个分区（实木复合地板中的S3）。但是，我看到星火打开表中的所有实木复合地板的文件证据，而不仅仅是通过过滤器。这使得即使是小规模的查询昂贵有大量的分区表。

I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions.

下面是一个说明性的例子。我用SparkSQL和蜂巢metastore在S3上创建了一个简单的分区表：

Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore:

# Make some data
df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 
                       'k': ['a', 'e', 'i', 'o', 'u']*3, 
                       'v': range(15)})
# Convert to a SparkSQL DataFrame
sdf = hiveContext.createDataFrame(df)
# And save it
sdf.write.partitionBy('pk').saveAsTable('dataset',
                                        format='parquet',
                                        path='s3a://bucket/dataset')

在随后的会议上，我要选择此表的子集：

In a subsequent session, I want to select a subset of this table:

dataset = hiveContext.table('dataset')
filtered_dataset = dataset.filter(dataset.pk == 'b')
print filtered_dataset.toPandas()

在正在打印随后的日志，我看到修剪的应该的发生：

In the subsequent logs being printed, I see that pruning is supposed to be happening:

15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.

但后来我看到的所有分区被打开实木复合地板的文件：

But then I see parquet files being opened from all partitions:

15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152

使用只有三个分区，这是没有问题的，但---数千，它会导致明显的延迟。为什么所有这些无关紧要的文件被打开？

With only three partitions, this isn't a problem---but with thousands, it causes noticeable delay. Why are all these irrelevant files being opened?

分区不能被修剪简单SparkSQL查询 [英] Partitions not being pruned in simple SparkSQL queries

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录关闭

分区不能被修剪简单SparkSQL查询 [英] Partitions not being pruned in simple SparkSQL queries

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录 关闭

登录关闭