如何使 saveAsTextFile 不将输出拆分为多个文件? [英] how to make saveAsTextFile NOT split output into multiple file?

查看:36
本文介绍了如何使 saveAsTextFile 不将输出拆分为多个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 中使用 Scala 时,每当我使用 saveAsTextFile 转储结果时,它似乎将输出分成多个部分.我只是向它传递一个参数(路径).

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.

val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")

  1. 输出的数量是否与它使用的减速器数量相对应?
  2. 这是否意味着输出被压缩?
  3. 我知道我可以使用 bash 将输出组合在一起,但是是否可以选择将输出存储在单个文本文件中而不拆分?我查看了 API 文档,但并没有对此进行太多说明.

推荐答案

将其保存为多个文件的原因是因为计算是分布式的.如果输出足够小,以至于您认为可以将其安装在一台机器上,那么您可以用

The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with

val arr = year.collect()

然后将结果数组保存为文件,另一种方法是使用自定义分区器,partitionBy,并使所有内容都转到一个分区,尽管这样做是不可取的,因为您不会获得任何并行化.

And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.

如果您需要使用 saveAsTextFile 保存文件,您可以使用 coalesce(1,true).saveAsTextFile().这基本上意味着进行计算然后合并到 1 个分区.您还可以使用 repartition(1),它只是 coalesce 的包装器,shuffle 参数设置为 true.查看RDD的源码.scala 是我发现大部分内容的方式,你应该看看.

If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.

这篇关于如何使 saveAsTextFile 不将输出拆分为多个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆