在简单的 SparkSQL 查询中未修剪分区 [英] Partitions not being pruned in simple SparkSQL queries

查看：23 发布时间：2021/11/14 22:40:04 amazon-s3 apache-spark apache-spark-sql pyspark parquet

本文介绍了在简单的 SparkSQL 查询中未修剪分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 SparkSQL 表(S3 中的镶木地板)中有效地选择单个分区.但是，我看到 Spark 打开表中所有镶木地板文件的证据，而不仅仅是那些通过过滤器的文件.这使得对于具有大量分区的表来说，即使是小查询也很昂贵.

I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions.

这是一个说明性示例.我使用 SparkSQL 和 Hive 元存储在 S3 上创建了一个简单的分区表:

Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore:

# Make some data
df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 
                       'k': ['a', 'e', 'i', 'o', 'u']*3, 
                       'v': range(15)})
# Convert to a SparkSQL DataFrame
sdf = hiveContext.createDataFrame(df)
# And save it
sdf.write.partitionBy('pk').saveAsTable('dataset',
                                        format='parquet',
                                        path='s3a://bucket/dataset')

在随后的会话中，我想选择此表的一个子集:

In a subsequent session, I want to select a subset of this table:

dataset = hiveContext.table('dataset')
filtered_dataset = dataset.filter(dataset.pk == 'b')
print filtered_dataset.toPandas()

在打印的后续日志中，我看到修剪应该正在发生:

In the subsequent logs being printed, I see that pruning is supposed to be happening:

15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.

但后来我看到从所有分区打开了镶木地板文件:

But then I see parquet files being opened from all partitions:

15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152

只有三个分区，这不是问题---但是如果有数千个，它会导致明显的延迟.为什么这些不相关的文件都被打开了?

With only three partitions, this isn't a problem---but with thousands, it causes noticeable delay. Why are all these irrelevant files being opened?

在简单的 SparkSQL 查询中未修剪分区 [英] Partitions not being pruned in simple SparkSQL queries

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在简单的 SparkSQL 查询中未修剪分区 [英] Partitions not being pruned in simple SparkSQL queries

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭