数据帧Pyspark中的时间戳列的分区 [英] Partition of Timestamp column in Dataframes Pyspark

查看：186 发布时间：2020/10/17 1:54:46 apache-spark dataframe timestamp pyspark partition

本文介绍了数据帧Pyspark中的时间戳列的分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在PSspark中有一个 DataFrame ，格式如下：

I have a DataFrame in PSspark in the below format

Date        Id  Name    Hours   Dno Dname
12/11/2013  1   sam     8       102 It
12/10/2013  2   Ram     7       102 It
11/10/2013  3   Jack    8       103 Accounts
12/11/2013  4   Jim     9       101 Marketing

我想基于 dno 并使用Parquet格式另存为Hive表。

I want to do partition based on dno and save as table in Hive using Parquet format.

df.write.saveAsTable(
    'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')

查询工作正常，并使用Parquet输入在Hive中创建表。

The query worked fine and created table in Hive with Parquet input.

现在我想进行分区基于日期列的年和月。时间戳是Unix时间戳

Now I want to do partitioned based on the year and month of the date column. The timestamp is Unix timestamp

我们如何在PySpark中实现它。我已经在蜂巢中完成了它，但是无法做到这一点PySpark

how can we achieve that in PySpark. I have done it in hive but unable to do it PySpark

推荐答案

仅提取要使用的字段并提供列列表作为作者 partitionBy 的参数。如果 timestamp 是UNIX时间戳，以秒为单位：

Just extract fields you want to use and provide a list of columns as an argument to the partitionBy of the writer. If timestamp is UNIX timestamps expressed in seconds:

df = sc.parallelize([
    (1484810378, 1, "sam", 8, 102, "It"),
    (1484815300, 2, "ram", 7, 103, "Accounts")
]).toDF(["timestamp", "id", "name", "hours", "dno", "dname"])

添加列：

from pyspark.sql.functions import year, month, col

df_with_year_and_month = (df
    .withColumn("year", year(col("timestamp").cast("timestamp")))
    .withColumn("month", month(col("timestamp").cast("timestamp"))))

并写：

(df_with_year_and_month
    .write
    .partitionBy("year", "month")
    .mode("overwrite")
    .format("parquet")
    .saveAsTable("default.testing"))

这篇关于数据帧Pyspark中的时间戳列的分区的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数据帧Pyspark中的时间戳列的分区 [英] Partition of Timestamp column in Dataframes Pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据帧Pyspark中的时间戳列的分区 [英] Partition of Timestamp column in Dataframes Pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭