在pyspark中编写实木复合地板时删除分区列 [英] Drop partition columns when writing parquet in pyspark

查看：130 发布时间：2020/10/16 19:56:05 python apache-spark pyspark databricks

本文介绍了在pyspark中编写实木复合地板时删除分区列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有日期列的数据框。我已经将其解析为年，月，日列。我想对这些列进行分区，但是我不希望这些列保留在镶木地板文件中。

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.

这是我对数据进行分区和写入的方法：

Here is my approach to partitioning and writing the data:

df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))

df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')

这将正确创建镶木地板文件，包括嵌套的文件夹结构。但是，我不希望实木复合地板文件中的年，月或日列。

This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.

推荐答案

Spark / Hive无法编写 镶木文件 中的 年，月，日 列因为它们已经在 partitionBy 子句中。

Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.

示例：

val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.

检查csv文件的内容：

hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv

输出：

您可以看到有 < csv文件中包含的code>无ID值 ，如果您编写 parquet文件，则采用相同的方式 分区列不包含在part-*。parquet文件中。

As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.

要检查实木复合地板文件的架构：

To check schema of parquet file:

parquet-tools schema <hdfs://nn:8020/parquet_file>

您还可以验证镶木地板文件中包含的所有列。

You can also verify what are all the columns included in your parquet file.

这篇关于在pyspark中编写实木复合地板时删除分区列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在pyspark中编写实木复合地板时删除分区列 [英] Drop partition columns when writing parquet in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在pyspark中编写实木复合地板时删除分区列 [英] Drop partition columns when writing parquet in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭