指定Apache中星火输出文件名 [英] Specifying the output file name in Apache Spark
问题描述
我有我想要迁移到一个PySpark马preduce工作。有没有定义输出文件的名称,而不是让任何方式部分-XXXXX
?
I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx
?
在MR,我用的是 org.apache.hadoop.ma pred.lib.MultipleTextOutputFormat
类来实现这一目标,
In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class to achieve this,
PS:我也尝试了 saveAsTextFile()
方法。例如:
PS: I did try the saveAsTextFile()
method. For example:
lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")
这将创建一个相同的部分-0000
文件。
This will create the same part-0000
files.
[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r----- 2 itsjeevs itsjeevs 0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r-- 2 itsjeevs itsjeevs 101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r-- 2 itsjeevs itsjeevs 17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001
修改
最近看了<一个href=\"http://databricks.com/blog/2014/09/17/spark-1-1-bringing-hadoop-inputoutput-formats-to-pyspark.html\"相对=nofollow>的文章这将让生活更容易为用户星火
Recently read the article which would make life much easier for Spark users.
推荐答案
星火还使用引擎盖下的Hadoop,因此你也许可以得到你想要的。这是怎么 saveAsTextFile
实施
Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile
is implemented:
def saveAsTextFile(path: String) {
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}
您可以在自定义的 OUTPUTFORMAT
传递给 saveAsHadoopFile
。我不知道该怎么做,从Python中虽然。对不起,不完整的答案。
You could pass in a customized OutputFormat
to saveAsHadoopFile
. I have no idea how to do that from Python though. Sorry for the incomplete answer.
这篇关于指定Apache中星火输出文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!