读取没有分区列名称的分区列 [英] Reading partition columns without partition column names

查看：79 发布时间：2021/4/3 19:26:30 apache-spark amazon-s3 pyspark parquet partition

本文介绍了读取没有分区列名称的分区列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们将数据存储在s3中，并按以下结构进行分区:

We have data stored in s3 partitioned in the following structure:

bucket/directory/table/aaaa/bb/cc/dd/

其中 aaaa 是年份， bb 是月份， cc 是日期，而 dd 是小时.

where aaaa is the year, bb is the month, cc is the day and dd is the hour.

如您所见，路径中没有分区键( year = aaaa ， month = bb ， day = cc ， hour = dd).

As you can see, there are no partition keys in the path (year=aaaa, month=bb, day=cc, hour=dd).

因此，当我将表读入Spark时，没有 year ， month ， day 或 hour 列.

As a result, when I read the table into Spark, there is no year, month, day or hour columns.

无论如何，我可以将表格读入Spark并包括分区列不:

Is there anyway I can read the table into Spark and include the partitioned column without:

在s3中更改路径名
在循环中遍历每个分区的值，并将每个分区逐个读取到Spark中(这是一个巨大的表，它花费的时间太长，而且显然不是最优的).

推荐答案

Spark无法

Spark can't discover partitions that aren't encoded as partition_name=value in the path so you'll have to create them.

在将路径 bucket/directory/table/aaaa/bb/cc/dd/加载到您的DataFrame中之后，您可以从使用 input_file_name() .

After you load the the paths bucket/directory/table/aaaa/bb/cc/dd/ into you a DataFrame, you can extract those partitions from the source filename which you get with input_file_name().

首先，使用/分隔符分割文件名路径，然后从最后4个元素创建列:

First, split the filename path using / delimiter then create columns from the last 4 elements:

from pyspark.sql import functions as F

df1 = df.withColumn("date_partitions", F.slice(F.split(F.input_file_name(), "/"), -5, 4)) \
    .withColumn("year", F.col("date_partitions").getItem(0)) \
    .withColumn("month", F.col("date_partitions").getItem(1)) \
    .withColumn("day", F.col("date_partitions").getItem(2)) \
    .withColumn("hour", F.col("date_partitions").getItem(3)) \
    .drop("data_partitions")

示例:

data = [
    (1, 2, "bucket/directory/table/2021/01/10/14/"),
    (3, 4, "bucket/directory/table/2021/01/11/18/")
]

df = spark.createDataFrame(data, ["a", "b", "input_file_name"])

赠予:

#+---+---+-------------------------------------+----+-----+---+----+
#|a  |b  |input_file_name                      |year|month|day|hour|
#+---+---+-------------------------------------+----+-----+---+----+
#|1  |2  |bucket/directory/table/2021/01/10/14/|2021|01   |10 |14  |
#|3  |4  |bucket/directory/table/2021/01/11/18/|2021|01   |11 |18  |
#+---+---+-------------------------------------+----+-----+---+----+

这篇关于读取没有分区列名称的分区列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读取没有分区列名称的分区列 [英] Reading partition columns without partition column names

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

读取没有分区列名称的分区列 [英] Reading partition columns without partition column names

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭