如何使saveAsTextFile输出功率不拆分为多个文件? [英] how to make saveAsTextFile NOT split output into multiple file?

查看:516
本文介绍了如何使saveAsTextFile输出功率不拆分为多个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark中使用Scala,每当我倾倒导致了使用saveAsTextFile,它似乎输出分割成多个部分。我只是传递一个参数(路径)给它。

When using Scala in Spark, whenever I dump result out using saveAsTextFile, it seems to split the output into multiple part. I'm just passing a parameter(path) to it.

val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")

不输出的数量对应于它使用减速机的数量?
这是否意味着输出为com pressed?

does the number of output correspond to the number of reducer it uses? does this mean the output is compressed?

我知道我可以结合使用bash输出在一起,但有存储在一个文本文件的输出选项,斗而不破?我看了看API文档,但它并没有说太多关于这一点。

I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.

推荐答案

这将其保存为多个文件,是因为计算是分布式的原因。如果输出是足够短,你觉得你能适应这台机器上,那么你就可以结束你的程序

The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with

val arr = year.collect()

和然后保存生成的数组为一个文件,另一种方法是使用自定义分区,<一个href=\"http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions\"><$c$c>partitionBy,并使它所以一切都以一个分区,但因为你不会得到任何并行是不可取的。

And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.

如果您需要的文件, saveAsTextFile 来保存你可以使用 COALESCE(1,TRUE).saveAsTextFile()。这基本上意味着做计算,然后合并为1分区。您还可以使用重新分区(1)这仅仅是包装凝聚设置为true洗牌的说法。期待通过的<一个源href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L307\">RDD.scala我是怎么想通极本东西出来,你应该看一看。

If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.

这篇关于如何使saveAsTextFile输出功率不拆分为多个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆