非分区镶木地板数据的谓词下推 [英] Predicate pushdown on non partitioned parquet data

查看:31
本文介绍了非分区镶木地板数据的谓词下推的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 S3 中有一个包含 Parquet 数据的文件夹:

I have a folder in S3 which contains Parquet data:

bucket_name/folder_name/YEAR/MONTH/DAY

bucket_name/folder_name/YEAR/MONTH/DAY

eg:
s3://bucket_name/folder_name/2020/12/10

我在 AWS EMR 上使用 Apache spark 来读取镶木地板文件.

I am using Apache spark on AWS EMR to read the parquet files.

由于数据没有分区,有没有办法在不分区数据的情况下实现谓词下推过滤器?

As the data is not partitioned, is there a way to implement predicate push down filter without partitioning the data?

可以使用哪些性能改进最佳实践.

What performance improvement best practices can be used.

推荐答案

我将在代码中描述我的解决方案:

I'll describe my solution in the code:

import pyspark.sql.functions as f
from pyspark.shell import spark

# Read absolute path and put "/*/*/*" to read all partitions
df = spark.read.parquet("s3://bucket_name/folder_name/*/*/*")

# Get absolute file path
df = df.withColumn('path', f.split(f.input_file_name(), '/'))

# Slice path and recover year / month / day in an array
df = df.withColumn('year_month_day', f.slice(f.col('path'), -4, 3))

# Transform array values to respective columns
df = df.withColumn('year', f.col('year_month_day').getItem(0))
df = df.withColumn('month', f.col('year_month_day').getItem(1))
df = df.withColumn('day', f.col('year_month_day').getItem(2))

# Drop temporary columns
df = df.drop('path', 'year_month_day')

df.show()

# TODO : Make your transformations
# .
# .
# .
# Save partitioned by year, month and day (if you want)
# df.write.partitionBy('year', 'month', 'day').parquet('...')

我的目录:

输出:

+--------+--------+----+-----+---+
|column_a|column_b|year|month|day|
+--------+--------+----+-----+---+
| hello_1| hello_2|2019|   06| 10|
| world_1| world_2|2020|   12| 31|
+--------+--------+----+-----+---+

这篇关于非分区镶木地板数据的谓词下推的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆