如何将RDD数据保存到json文件而不是文件夹中 [英] How to save RDD data into json files, not folders
问题描述
我正在接收要保存在S3中的流数据myDStream
(DStream[String]
)(基本上,对于此问题,我想将输出确切地保存在什么位置都没有关系,但是我提到的是以防万一).
I am receiving the streaming data myDStream
(DStream[String]
) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).
以下代码效果很好,但是它保存名称为jsonFile-19-45-46.json
的文件夹,然后在文件夹内保存文件_SUCCESS
和part-00000
.
The following code works well, but it saves folders with the names like jsonFile-19-45-46.json
, and then inside the folders it saves files _SUCCESS
and part-00000
.
是否可以将每个RDD[String]
(这些是JSON字符串)数据保存到JSON 文件中,而不是文件夹中?我以为repartition(1)
必须做出这个trick俩,但事实并非如此.
Is it possible to save each RDD[String]
(these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1)
had to make this trick, but it didn't.
myDStream.foreachRDD { rdd =>
// datetimeString = ....
rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
}
推荐答案
AFAIK没有将其另存为文件的选项.因为这是一个分布式处理框架,而且不是在单个文件上写的好习惯,而是每个分区都在指定路径中写自己的文件.
AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.
我们只能传递要保存数据的输出目录. OutputWriter将在指定路径内使用
part-
文件名前缀创建文件(取决于分区).
We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with
part-
file name prefix.
这篇关于如何将RDD数据保存到json文件而不是文件夹中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!