RDD保存在pyspark序列文件 [英] Saving RDD as sequence file in pyspark

查看：2898 发布时间：2016/5/22 16:43:00 python apache-spark pyspark sequencefile

本文介绍了RDD保存在pyspark序列文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我能够运行此脚本保存在文本格式文件，但是当我尝试运行saveAsSequenceFile它示数出来。如果任何一个有关于如何将RDD保存为序列文件的想法，请让我知道这个过程。我试图寻找学习星火的解决方案，以及官方星火文档。

这成功运行

  dataRDD = sc.textFile（/用户/ Cloudera的/ sqoop_import /部门）
dataRDD.saveAsTextFile（/用户/ Cloudera的/ pyspark /部门）

这失败

  dataRDD = sc.textFile（/用户/ Cloudera的/ sqoop_import /部门）
dataRDD.saveAsSequenceFile（/用户/ Cloudera的/ pyspark / departmentsSeq）

错误：在调用时发生错误
  Z：org.apache.spark.api.python.PythonRDD.saveAsSequenceFile。：
  org.apache.spark.SparkException：java.lang.String类型RDD元
  不能使用

下面是数据：
  2，健身
3，鞋
4，服装
5，高尔夫
6，户外
7，风扇店
8，测试
8000，测试
 
解决方案

序列文件是用来存储键 - 值对，所以你不能简单地存储 RDD [字符串] 。鉴于你的数据我猜你正在寻找的东西是这样的：
  RDD = sc.parallelize（[
    2，健身，3，鞋类，4，服装
]）
rdd.map（拉姆达X：元组（x.split（，，1）））saveAsSequenceFile（testSeq）。
 
如果你想保持整个字符串只使用无键：
  rdd.map（波长X：（无，X））。saveAsSequenceFile（testSeqNone）
 
I am able to run this script to save the file in text format, but when I try to run saveAsSequenceFile it is erroring out. If any one have idea about how to save the RDD as sequence file, please let me know the process. I tried looking for solution in "Learning Spark" as well as official Spark documentation.

This runs successfully
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments")
dataRDD.saveAsTextFile("/user/cloudera/pyspark/departments")
This fails
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments")
dataRDD.saveAsSequenceFile("/user/cloudera/pyspark/departmentsSeq")
Error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile. : org.apache.spark.SparkException: RDD element of type java.lang.String cannot be used

Here is the data:
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
8,TESTING
8000,TESTING
解决方案
Sequence files are used to store key-value pairs so you cannot simply store RDD[String]. Given your data I guess you're looking for something like this:
rdd = sc.parallelize([
    "2,Fitness", "3,Footwear", "4,Apparel"
])
rdd.map(lambda x: tuple(x.split(",", 1))).saveAsSequenceFile("testSeq")
If you want to keep whole strings just use None keys:
rdd.map(lambda x: (None, x)).saveAsSequenceFile("testSeqNone")
这篇关于RDD保存在pyspark序列文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

RDD保存在pyspark序列文件 [英] Saving RDD as sequence file in pyspark

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

RDD保存在pyspark序列文件 [英] Saving RDD as sequence file in pyspark

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭