在pyspark中编写实木复合地板时删除分区列 [英] Drop partition columns when writing parquet in pyspark
问题描述
我有一个带有日期列的数据框。我已经将其解析为年,月,日列。我想对这些列进行分区,但是我不希望这些列保留在镶木地板文件中。
I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
这是我对数据进行分区和写入的方法:
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
这将正确创建镶木地板文件,包括嵌套的文件夹结构。但是,我不希望实木复合地板文件中的年,月或日列。
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
推荐答案
Spark / Hive无法编写 镶木文件
中的 年,月,日
列因为它们已经在 partitionBy 子句中。
Spark/Hive won't write year,month,day
columns in your parquet files
as they are already in partitionBy clause.
示例:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
检查csv文件的内容:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
输出:
a
您可以看到有 < csv文件中包含的code>无ID值 ,如果您编写 parquet文件,则采用相同的方式
分区列不包含在part-*。parquet文件中。
As you can see there is no id value
included in the csv file, in the same way if you write parquet file
partition columns are not included in the part-*.parquet file.
要检查实木复合地板文件的架构:
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
您还可以验证镶木地板文件中包含的所有列。
You can also verify what are all the columns included in your parquet file.
这篇关于在pyspark中编写实木复合地板时删除分区列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!