在Spark中压缩序列文件? [英] Compressing sequence file in Spark?
问题描述
我正在尝试将RDD保存为压缩的Sequence文件.我能够保存被调用的非压缩文件:
I'm trying to save RDD as compressed Sequence file. I'm able to save non-compressed file be calling:
counts.saveAsSequenceFile(output)
其中的计数是我的RDD(可写,文本).但是,我没有设法压缩输出.我尝试了几种配置,但总是出现异常:
where counts is my RDD (IntWritable, Text). However, I didn't manage to compress output. I tried several configurations and always got exception:
counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec])
<console>:21: error: type mismatch;
found : Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec])
counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec])
<console>:21: error: type mismatch;
found : Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec])
甚至对于Gzip也无效:
and it doesn't work even for Gzip:
counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec])
<console>:21: error: type mismatch;
found : Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec])
能否请您提出解决方案?另外,我也找不到如何指定压缩参数(即Snappy的压缩类型)
Could you please suggest solution? also, I didn't find how is it possible to specify compression parameters (i.e. compression type for Snappy)
推荐答案
saveAsSequenceFile
的签名为 def saveAsSequenceFile(path:String,codec:Option [Class [_< ;: CompressionCodec]] =无)
.您需要发送 Option [Class [_< ;: CompressionCodec]]
作为 codec
.例如,
The signature of saveAsSequenceFile
is def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None)
. You need to send a Option[Class[_ <: CompressionCodec]]
as codec
. E.g.,
counts.saveAsSequenceFile(output, Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
如果您仔细阅读类型不匹配
的错误信息,则应该自己更正.
If you read the error information of type mismatch
carefully, you should have corrected it by yourself.
这篇关于在Spark中压缩序列文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!