RDD保存在pyspark序列文件 [英] Saving RDD as sequence file in pyspark
问题描述
我能够运行此脚本保存在文本格式文件,但是当我尝试运行saveAsSequenceFile它示数出来。如果任何一个有关于如何将RDD保存为序列文件的想法,请让我知道这个过程。我试图寻找学习星火的解决方案,以及官方星火文档。
这成功运行
dataRDD = sc.textFile(/用户/ Cloudera的/ sqoop_import /部门)
dataRDD.saveAsTextFile(/用户/ Cloudera的/ pyspark /部门)
这失败
dataRDD = sc.textFile(/用户/ Cloudera的/ sqoop_import /部门)
dataRDD.saveAsSequenceFile(/用户/ Cloudera的/ pyspark / departmentsSeq)
错误:在调用时发生错误
Z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile。 :
org.apache.spark.SparkException:java.lang.String类型RDD元
不能使用
块引用>下面是数据:
2,健身
3,鞋
4,服装
5,高尔夫
6,户外
7,风扇店
8,测试
8000,测试
解决方案序列文件是用来存储键 - 值对,所以你不能简单地存储
RDD [字符串]
。鉴于你的数据我猜你正在寻找的东西是这样的:RDD = sc.parallelize([
2,健身,3,鞋类,4,服装
])
rdd.map(拉姆达X:元组(x.split(,,1)))saveAsSequenceFile(testSeq)。如果你想保持整个字符串只使用
无
键:rdd.map(波长X:(无,X))。saveAsSequenceFile(testSeqNone)
I am able to run this script to save the file in text format, but when I try to run saveAsSequenceFile it is erroring out. If any one have idea about how to save the RDD as sequence file, please let me know the process. I tried looking for solution in "Learning Spark" as well as official Spark documentation.
This runs successfully
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") dataRDD.saveAsTextFile("/user/cloudera/pyspark/departments")
This fails
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") dataRDD.saveAsSequenceFile("/user/cloudera/pyspark/departmentsSeq")
Error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile. : org.apache.spark.SparkException: RDD element of type java.lang.String cannot be used
Here is the data:
2,Fitness 3,Footwear 4,Apparel 5,Golf 6,Outdoors 7,Fan Shop 8,TESTING 8000,TESTING
解决方案Sequence files are used to store key-value pairs so you cannot simply store
RDD[String]
. Given your data I guess you're looking for something like this:rdd = sc.parallelize([ "2,Fitness", "3,Footwear", "4,Apparel" ]) rdd.map(lambda x: tuple(x.split(",", 1))).saveAsSequenceFile("testSeq")
If you want to keep whole strings just use
None
keys:rdd.map(lambda x: (None, x)).saveAsSequenceFile("testSeqNone")
这篇关于RDD保存在pyspark序列文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!