非分区镶木地板数据的谓词下推 [英] Predicate pushdown on non partitioned parquet data
本文介绍了非分区镶木地板数据的谓词下推的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在 S3 中有一个包含 Parquet 数据的文件夹:
I have a folder in S3 which contains Parquet data:
bucket_name/folder_name/YEAR/MONTH/DAY
bucket_name/folder_name/YEAR/MONTH/DAY
eg:
s3://bucket_name/folder_name/2020/12/10
我在 AWS EMR
上使用 Apache spark
来读取镶木地板文件.
I am using Apache spark
on AWS EMR
to read the parquet files.
由于数据没有分区,有没有办法在不分区数据的情况下实现谓词下推过滤器?
As the data is not partitioned, is there a way to implement predicate push down filter without partitioning the data?
可以使用哪些性能改进最佳实践.
What performance improvement best practices can be used.
推荐答案
我将在代码中描述我的解决方案:
I'll describe my solution in the code:
import pyspark.sql.functions as f
from pyspark.shell import spark
# Read absolute path and put "/*/*/*" to read all partitions
df = spark.read.parquet("s3://bucket_name/folder_name/*/*/*")
# Get absolute file path
df = df.withColumn('path', f.split(f.input_file_name(), '/'))
# Slice path and recover year / month / day in an array
df = df.withColumn('year_month_day', f.slice(f.col('path'), -4, 3))
# Transform array values to respective columns
df = df.withColumn('year', f.col('year_month_day').getItem(0))
df = df.withColumn('month', f.col('year_month_day').getItem(1))
df = df.withColumn('day', f.col('year_month_day').getItem(2))
# Drop temporary columns
df = df.drop('path', 'year_month_day')
df.show()
# TODO : Make your transformations
# .
# .
# .
# Save partitioned by year, month and day (if you want)
# df.write.partitionBy('year', 'month', 'day').parquet('...')
我的目录:
输出:
+--------+--------+----+-----+---+
|column_a|column_b|year|month|day|
+--------+--------+----+-----+---+
| hello_1| hello_2|2019| 06| 10|
| world_1| world_2|2020| 12| 31|
+--------+--------+----+-----+---+
这篇关于非分区镶木地板数据的谓词下推的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文