我可以从Spark程序而不是RDD写入纯文本HDFS(或本地)文件吗? [英] Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

查看:692
本文介绍了我可以从Spark程序而不是RDD写入纯文本HDFS(或本地)文件吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark程序(在Scala中)和一个 SparkContext 。我正在用 RDD saveAsTextFile 编写一些文件。在我的本地机器上,我可以使用本地文件路径,它可以与本地文件系统一起使用。在我的集群上,它可以与HDFS一起工作。



我还想编写其他任意文件作为处理结果。我将它们作为常规文件写在本地计算机上,但希望它们进入群集中的HDFS。

SparkContext 似乎有一些文件相关的方法,但它们似乎都是输入而不是输出。 p>

我该怎么做? 感谢marios和kostya,但从Spark写入文本文件到HDFS的步骤很少。

  // Hadoop配置可从SparkContext访问
val fs = FileSystem.get(sparkContext.hadoopConfiguration);

//可以从文件系统创建输出文件。
val output = fs.create(new Path(filename));

//但是BufferedOutputStream必须用于输出实际的文本文件。
val os = BufferedOutputStream(输出)
$ b $ os.write(Hello World.getBytes(UTF-8))

os.close()

请注意, FSDataOutputStream 是Java序列化对象输出流,而不是文本输出流。 writeUTF 方法似乎编写plaint文本,但它实际上是一个包含额外字节的二进制序列化格式。


I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.

I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.

SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.

How do I do this?

解决方案

Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.

// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration); 

// Output file can be created from file system.
val output = fs.create(new Path(filename));

// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)

os.write("Hello World".getBytes("UTF-8"))

os.close()

Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.

这篇关于我可以从Spark程序而不是RDD写入纯文本HDFS(或本地)文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆