Spark saveAsTextFile()写入多个文件,而不是一个 [英] Spark saveAsTextFile() writes to multiple files instead of one

查看:47
本文介绍了Spark saveAsTextFile()写入多个文件,而不是一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此刻,我正在笔记本电脑上使用Spark和Scala.

I am using Spark and Scala on my laptop at this moment.

当我将RDD写入文件时,输出将写入两个文件"part-00000"和"part-00001".如何强制Spark/Scala 写入一个文件?

When I write an RDD to a file, the output is written to two files "part-00000" and "part-00001". How can I force Spark / Scala to write to one file?

我的代码当前为:

myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")

我在其中删除括号以写出键,值对

where I am removing the parentheses to write out key,value pairs.

推荐答案

问题"确实是一个功能,它是由您的 RDD 分区方式产生的,因此将其分隔为 n 个部分,其中 n 是分区数.要解决此问题,您只需要使用

The "problem" is indeed a feature, and it is produced by how your RDD is partitioned, hence it is separated in n parts where n is the number of partitions. To fix this you just need to change the number of partitions to one, by using repartition on your RDD. The documentation states:

repartition(numPartitions)

repartition(numPartitions)

返回一个完全具有numPartitions分区的新RDD.

Return a new RDD that has exactly numPartitions partitions.

可以增加或减少此RDD中的并行度.在内部,这使用混洗来重新分配数据.如果你是减少此RDD中的分区数,请考虑使用合并,可以避免执行随机播放.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

例如,此更改应该起作用.

For example, this change should work.

myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")

如文档所述,您还可以使用 coalesce ,这实际上是减少分区数量时的推荐选项.但是,将分区的数量减少到一个被认为是一个坏主意,因为这会导致将数据改组到一个节点并失去并行性.

As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.

这篇关于Spark saveAsTextFile()写入多个文件,而不是一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆