如何合并由 SPARK 数据框创建的文件夹中的所有零件文件并在 Scala 中重命名为文件夹名称 [英] How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala

查看：27 发布时间：2021/11/14 23:03:36 scala apache-spark hdfs spark-dataframe hadoop2

本文介绍了如何合并由 SPARK 数据框创建的文件夹中的所有零件文件并在 Scala 中重命名为文件夹名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有我的 spark 数据框的输出，它创建了文件夹结构并创建了部分文件.现在我必须合并文件夹内的所有部分文件并将该文件重命名为文件夹路径名.

Hi i have output of my spark data frame which creates folder structure and creates so may part files . Now i have to merge all part files inside the folder and rename that one file as folder path name .

这就是我做分区的方式

df.write.partitionBy("DataPartition","PartitionYear")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")/
  .option("codec", "gzip")
  .save("hdfs:///user/zeppelin/FinancialLineItem/output")

它创建这样的文件夹结构

It creates folder structure like this

hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz
hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00002-87a61115-92c9-4926-a803-b46315e55a08.c001.csv.gz

我必须像这样创建最终文件

I have to create final file like this

hdfs:///user/zeppelin/FinancialLineItem/output/Japan.1971.currenttime.csv.gz

此处没有将 001 和 002 合并为二合一的零件文件.

No part files here bith 001 and 002 is merged two one .

我的数据大小非常大，300 GB gzip 和 35 GB 压缩，所以 coalesce(1) 和 repartition 变得非常慢.

My data size it very big 300 GB gzip and 35 GB zipped so coalesce(1) and repartition becomes very slow .

我在这里看到了一个解决方案使用 spark-csv 写入单个 CSV 文件但我我无法实现它，请帮助我.

I have seen one solution here Write single CSV file using spark-csv but i am not able to implement it please help me with it .

重新分区抛出错误

error: value repartition is not a member of org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]
       dfMainOutputFinalWithoutNull.write.repartition("DataPartition","StatementTypeCode")

如何合并由 SPARK 数据框创建的文件夹中的所有零件文件并在 Scala 中重命名为文件夹名称 [英] How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何合并由 SPARK 数据框创建的文件夹中的所有零件文件并在 Scala 中重命名为文件夹名称 [英] How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭