Spark:PartitionBy，更改输出文件名 [英] Spark: PartitionBy, change output file name

查看：391 发布时间：2020/9/4 20:47:14 apache-spark pyspark hdfs spark-dataframe

本文介绍了Spark:PartitionBy，更改输出文件名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

当前，当我使用paritionBy写入HDFS时:DF.write.partitionBy("id")

Currently , when I use the paritionBy to write to HDFS: DF.write.partitionBy("id")

我将获得类似于(默认行为)的输出结构

I will get output structure looking like (which is the default behaviour)

../id = 1/

../id=1/

../id = 2/

../id=2/

../id = 3/

../id=3/

我想要一个看起来像这样的结构

I would like a structure looking like:

../a/

../b/

../c/

这样

if id = 1, then a
if id = 2, then b

..等等

是否可以更改文件名输出?如果不是，最好的方法是什么?

Is there a way to change the filename output? If not What is the best way to do this?

您将无法使用Spark的partitionBy来实现这一目标.

You won't be able to use Spark's partitionBy to achieve this.

相反，您必须将DataFrame分成其组件分区，并将它们一个个保存，就像这样:

Instead, you have to break your DataFrame into its component partitions, and save them one by one, like so:

base = ord('a') - 1
for id in range(1, 4):
    DF.filter(DF['id'] == id).write.save("..." + chr(base + id))
}

或者，您可以使用Spark的partitionBy工具写入整个数据帧，然后使用HDFS API手动重命名分区.

Alternatively, you can write the entire dataframe using Spark's partitionBy facility, and then manually rename the partitions using HDFS APIs.

这篇关于Spark:PartitionBy，更改输出文件名的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文