pyspark如何加载压缩的快照文件 [英] pyspark how to load compressed snappy file

查看:129
本文介绍了pyspark如何加载压缩的快照文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用python-snappy压缩了文件并将其放在我的hdfs存储中.我现在正试图以这种方式阅读它,但是我得到了以下回溯.我找不到如何读取文件的示例,因此可以对其进行处理.我可以阅读文本文件(未压缩)的版本.我应该使用sc.sequenceFile吗?谢谢!

I have compressed a file using python-snappy and put it in my hdfs store. I am now trying to read it in like so but I get the following traceback. I can't find an example of how to read the file in so I can process it. I can read the text file (uncompressed) version fine. Should I be using sc.sequenceFile ? Thanks!

I first compressed the file and pushed it to hdfs

python-snappy -m snappy -c gene_regions.vcf gene_regions.vcf.snappy
hdfs dfs -put gene_regions.vcf.snappy /

I then added the following to spark-env.sh
export SPARK_EXECUTOR_MEMORY=16G                                                
export HADOOP_HOME=/usr/local/hadoop                                            

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native             
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native                 
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native           
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/lib/snappy-java-1.1.1.8-SNAPSHOT.jar

I then launch my spark master and slave and finally my ipython notebook where I am executing the code below.

a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf.snappy")
a_file.first()


ValueError跟踪(最近一次通话最后一次) 在 () ----> 1个a_file.first()


ValueError Traceback (most recent call last) in () ----> 1 a_file.first()

/home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc 第1244章 1245返回rs [0] -> 1246提高ValueError("RDD为空") 1247 1248 def isEmpty(self):

/home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in first(self) 1244 if rs: 1245 return rs[0] -> 1246 raise ValueError("RDD is empty") 1247 1248 def isEmpty(self):

ValueError:RDD为空

ValueError: RDD is empty

Working code (uncompressed) text file
a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf")
a_file.first()

输出: u'## fileformat = VCFv4.1'

output: u'##fileformat=VCFv4.1'

推荐答案

此处的问题是python-snappy与Hadoop的snappy编解码器不兼容,这是Spark在看到".snappy"时将用来读取数据的方式. "后缀.它们基于相同的基础算法,但是它们不兼容,因为您可以使用一种进行压缩,而可以使用另一种进行解压缩.

The issue here is that python-snappy is not compatible with Hadoop's snappy codec, which is what Spark will use to read the data when it sees a ".snappy" suffix. They are based on the same underlying algorithm but they aren't compatible in that you can compress with one and decompress with another.

您可以通过首先将数据写出到使用Spark或Hadoop的快照来完成此工作.或通过让Spark读取数据作为二进制Blob,然后自己手动调用python-snappy解压缩(请参见此处的binaryFiles

You can make this work either by writing your data out in the first place to snappy using Spark or Hadoop. Or by having Spark read your data as binary blobs and then you manually invoke the python-snappy decompression yourself (see binaryFiles here http://spark.apache.org/docs/latest/api/python/pyspark.html). The binary blob approach is a bit more brittle because it needs to fit the entire file in memory for each input file. But if your data is small enough that will work.

这篇关于pyspark如何加载压缩的快照文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆