如何将RDD数据保存到json文件而不是文件夹中 [英] How to save RDD data into json files, not folders

查看:340
本文介绍了如何将RDD数据保存到json文件而不是文件夹中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在接收要保存在S3中的流数据myDStream(DStream[String])(基本上,对于此问题,我想将输出确切地保存在什么位置都没有关系,但是我提到的是以防万一).

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn't matter where exactly do I want to save the outputs, but I am mentioning it just in case).

以下代码效果很好,但是它保存名称为jsonFile-19-45-46.json的文件夹,然后在文件夹内保存文件_SUCCESSpart-00000.

The following code works well, but it saves folders with the names like jsonFile-19-45-46.json, and then inside the folders it saves files _SUCCESS and part-00000.

是否可以将每个RDD[String](这些是JSON字符串)数据保存到JSON 文件中,而不是文件夹中?我以为repartition(1)必须做出这个trick俩,但事实并非如此.

Is it possible to save each RDD[String] (these are JSON strings) data into the JSON files, not the folders? I thought that repartition(1) had to make this trick, but it didn't.

    myDStream.foreachRDD { rdd => 
       // datetimeString = ....
       rdd.repartition(1).saveAsTextFile("s3n://mybucket/keys/jsonFile-"+datetimeString+".json")
    }

推荐答案

AFAIK没有将其另存为文件的选项.因为这是一个分布式处理框架,而且不是在单个文件上写的好习惯,而是每个分区都在指定路径中写自己的文件.

AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.

我们只能传递要保存数据的输出目录. OutputWriter将在指定路径内使用part-文件名前缀创建文件(取决于分区).

We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.

这篇关于如何将RDD数据保存到json文件而不是文件夹中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆