我如何将RDD保存到HDFS中,然后再读回它? [英] How can I save an RDD into HDFS and later read it back?

查看:975
本文介绍了我如何将RDD保存到HDFS中,然后再读回它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD,其元素是类型(Long,String)。出于某种原因,我想将整个RDD保存到HDFS中,并且稍后还将该RDD读回到Spark程序中。有可能这样做吗?如果是这样,怎么样?

解决方案

这是可能的。在RDD中,您有 saveAsObjectFile saveAsTextFile 函数。元组被存储为(value1,value2),所以您可以稍后解析它。

可以用 textFile 函数从SparkContext,然后 .map 消除()



因此:
版本1:

  rdd .saveAsTextFile(hdfs:/// test1 /); 
//以后,在其他程序中
val newRdds = sparkContext.textFile(hdfs:/// test1 / part- *).map(x => {
// here remove()和解析长/字符串
})

版本2:

  rdd.saveAsObjectFile(hdfs:/// test1 /); 
//以后,在其他程序中 - 请注意,您的元组已开箱即用:)
val newRdds = sparkContext.sc.sequenceFile(hdfs:/// test1 / part- *,classOf [Long],classOf [String])


I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?

解决方案

It is possible.

In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.

Reading can be done with textFile function from SparkContext and then .map to eliminate ()

So: Version 1:

rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
    // here remove () and parse long / strings
})

Version 2:

rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])

这篇关于我如何将RDD保存到HDFS中,然后再读回它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆