Spark:PartitionBy,更改输出文件名 [英] Spark: PartitionBy, change output file name
问题描述
目前,当我使用 paritionBy()
写入 HDFS 时:
Currently, when I use the paritionBy()
to write to HDFS:
DF.write.partitionBy("id")
我会得到看起来像的输出结构(这是默认行为):
I will get output structure looking like (which is the default behaviour):
../id=1/
../id=2/
../id=3/
我想要一个看起来像这样的结构:
I would like a structure looking like:
../a/
../b/
../c/
这样:
if id = 1, then a
if id = 2, then b
.. etc
有没有办法改变文件名输出?如果没有,最好的方法是什么?
Is there a way to change the filename output? If not, what is the best way to do this?
推荐答案
您将无法使用 Spark 的 partitionBy
来实现这一点.
You won't be able to use Spark's partitionBy
to achieve this.
相反,你必须将你的 DataFrame
分解成它的组件分区,并一一保存,如下所示:
Instead, you have to break your DataFrame
into its component partitions, and save them one by one, like so:
base = ord('a') - 1
for id in range(1, 4):
DF.filter(DF['id'] == id).write.save("..." + chr(base + id))
}
或者,您可以使用 Spark 的 partitionBy
工具编写整个数据帧,然后使用 HDFS API 手动重命名分区.
Alternatively, you can write the entire dataframe using Spark's partitionBy
facility, and then manually rename the partitions using HDFS APIs.
这篇关于Spark:PartitionBy,更改输出文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!