数据帧Pyspark中的时间戳列的分区 [英] Partition of Timestamp column in Dataframes Pyspark
问题描述
我在PSspark中有一个 DataFrame
,格式如下:
I have a DataFrame
in PSspark in the below format
Date Id Name Hours Dno Dname
12/11/2013 1 sam 8 102 It
12/10/2013 2 Ram 7 102 It
11/10/2013 3 Jack 8 103 Accounts
12/11/2013 4 Jim 9 101 Marketing
我想基于 dno
并使用Parquet格式另存为Hive表。
I want to do partition based on dno
and save as table in Hive using Parquet format.
df.write.saveAsTable(
'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')
查询工作正常,并使用Parquet输入在Hive中创建表。
The query worked fine and created table in Hive with Parquet input.
现在我想进行分区基于日期列的年和月。时间戳是Unix时间戳
Now I want to do partitioned based on the year and month of the date column. The timestamp is Unix timestamp
我们如何在PySpark中实现它。我已经在蜂巢中完成了它,但是无法做到这一点PySpark
how can we achieve that in PySpark. I have done it in hive but unable to do it PySpark
推荐答案
仅提取要使用的字段并提供列列表作为作者 partitionBy
的参数。如果 timestamp
是UNIX时间戳,以秒为单位:
Just extract fields you want to use and provide a list of columns as an argument to the partitionBy
of the writer. If timestamp
is UNIX timestamps expressed in seconds:
df = sc.parallelize([
(1484810378, 1, "sam", 8, 102, "It"),
(1484815300, 2, "ram", 7, 103, "Accounts")
]).toDF(["timestamp", "id", "name", "hours", "dno", "dname"])
添加列:
from pyspark.sql.functions import year, month, col
df_with_year_and_month = (df
.withColumn("year", year(col("timestamp").cast("timestamp")))
.withColumn("month", month(col("timestamp").cast("timestamp"))))
并写:
(df_with_year_and_month
.write
.partitionBy("year", "month")
.mode("overwrite")
.format("parquet")
.saveAsTable("default.testing"))
这篇关于数据帧Pyspark中的时间戳列的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!