pyspark如何加载压缩的快照文件 [英] pyspark how to load compressed snappy file

查看：129 发布时间：2020/7/7 5:30:23 apache-spark pyspark snappy

本文介绍了pyspark如何加载压缩的快照文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经使用python-snappy压缩了文件并将其放在我的hdfs存储中.我现在正试图以这种方式阅读它，但是我得到了以下回溯.我找不到如何读取文件的示例，因此可以对其进行处理.我可以阅读文本文件(未压缩)的版本.我应该使用sc.sequenceFile吗?谢谢！

I have compressed a file using python-snappy and put it in my hdfs store. I am now trying to read it in like so but I get the following traceback. I can't find an example of how to read the file in so I can process it. I can read the text file (uncompressed) version fine. Should I be using sc.sequenceFile ? Thanks!

I first compressed the file and pushed it to hdfs

python-snappy -m snappy -c gene_regions.vcf gene_regions.vcf.snappy
hdfs dfs -put gene_regions.vcf.snappy /

I then added the following to spark-env.sh
export SPARK_EXECUTOR_MEMORY=16G                                                
export HADOOP_HOME=/usr/local/hadoop                                            

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native             
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native                 
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native           
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/lib/snappy-java-1.1.1.8-SNAPSHOT.jar

I then launch my spark master and slave and finally my ipython notebook where I am executing the code below.

a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf.snappy")
a_file.first()

ValueError跟踪(最近一次通话最后一次) 在 () ----> 1个a_file.first()

ValueError Traceback (most recent call last) in () ----> 1 a_file.first()

/home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc 第1244章 1245返回rs [0] -> 1246提高ValueError("RDD为空") 1247 1248 def isEmpty(self):

/home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in first(self) 1244 if rs: 1245 return rs[0] -> 1246 raise ValueError("RDD is empty") 1247 1248 def isEmpty(self):

ValueError:RDD为空

ValueError: RDD is empty

Working code (uncompressed) text file
a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf")
a_file.first()

输出: u'## fileformat = VCFv4.1'

output: u'##fileformat=VCFv4.1'

pyspark如何加载压缩的快照文件 [英] pyspark how to load compressed snappy file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark如何加载压缩的快照文件 [英] pyspark how to load compressed snappy file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭