读取没有分区列名称的分区列 [英] Reading partition columns without partition column names
问题描述
我们将数据存储在s3中,并按以下结构进行分区:
We have data stored in s3 partitioned in the following structure:
bucket/directory/table/aaaa/bb/cc/dd/
其中 aaaa
是年份, bb
是月份, cc
是日期,而 dd
是小时.
where aaaa
is the year, bb
is the month, cc
is the day and dd
is the hour.
如您所见,路径中没有分区键( year = aaaa
, month = bb
, day = cc
, hour = dd)
.
As you can see, there are no partition keys in the path (year=aaaa
, month=bb
, day=cc
, hour=dd)
.
因此,当我将表读入Spark时,没有 year
, month
, day
或 hour
列.
As a result, when I read the table into Spark, there is no year
, month
, day
or hour
columns.
无论如何,我可以将表格读入Spark并包括分区列不:
Is there anyway I can read the table into Spark and include the partitioned column without:
- 在s3中更改路径名
- 在循环中遍历每个分区的值,并将每个分区逐个读取到Spark中(这是一个巨大的表,它花费的时间太长,而且显然不是最优的).
推荐答案
Spark can't discover partitions that aren't encoded as partition_name=value
in the path so you'll have to create them.
在将路径 bucket/directory/table/aaaa/bb/cc/dd/
加载到您的DataFrame中之后,您可以从使用 input_file_name()
.
After you load the the paths bucket/directory/table/aaaa/bb/cc/dd/
into you a DataFrame, you can extract those partitions from the source filename which you get with input_file_name()
.
首先,使用/
分隔符分割文件名路径,然后从最后4个元素创建列:
First, split the filename path using /
delimiter then create columns from the last 4 elements:
from pyspark.sql import functions as F
df1 = df.withColumn("date_partitions", F.slice(F.split(F.input_file_name(), "/"), -5, 4)) \
.withColumn("year", F.col("date_partitions").getItem(0)) \
.withColumn("month", F.col("date_partitions").getItem(1)) \
.withColumn("day", F.col("date_partitions").getItem(2)) \
.withColumn("hour", F.col("date_partitions").getItem(3)) \
.drop("data_partitions")
示例:
data = [
(1, 2, "bucket/directory/table/2021/01/10/14/"),
(3, 4, "bucket/directory/table/2021/01/11/18/")
]
df = spark.createDataFrame(data, ["a", "b", "input_file_name"])
赠予:
#+---+---+-------------------------------------+----+-----+---+----+
#|a |b |input_file_name |year|month|day|hour|
#+---+---+-------------------------------------+----+-----+---+----+
#|1 |2 |bucket/directory/table/2021/01/10/14/|2021|01 |10 |14 |
#|3 |4 |bucket/directory/table/2021/01/11/18/|2021|01 |11 |18 |
#+---+---+-------------------------------------+----+-----+---+----+
这篇关于读取没有分区列名称的分区列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!