在Spark中压缩序列文件? [英] Compressing sequence file in Spark?

查看:96
本文介绍了在Spark中压缩序列文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将RDD保存为压缩的Sequence文件.我能够保存被调用的非压缩文件:

I'm trying to save RDD as compressed Sequence file. I'm able to save non-compressed file be calling:

counts.saveAsSequenceFile(output)

其中的计数是我的RDD(可写,文本).但是,我没有设法压缩输出.我尝试了几种配置,但总是出现异常:

where counts is my RDD (IntWritable, Text). However, I didn't manage to compress output. I tried several configurations and always got exception:

counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec])
<console>:21: error: type mismatch;
 found   : Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
 required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
              counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec])

 counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec])
<console>:21: error: type mismatch;
 found   : Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
 required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
              counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec])

甚至对于Gzip也无效:

and it doesn't work even for Gzip:

 counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec])
<console>:21: error: type mismatch;
 found   : Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
 required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]]
              counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec])

能否请您提出解决方案?另外,我也找不到如何指定压缩参数(即Snappy的压缩类型)

Could you please suggest solution? also, I didn't find how is it possible to specify compression parameters (i.e. compression type for Snappy)

推荐答案

saveAsSequenceFile 的签名为 def saveAsSequenceFile(path:String,codec:Option [Class [_< ;: CompressionCodec]] =无).您需要发送 Option [Class [_< ;: CompressionCodec]] 作为 codec .例如,

The signature of saveAsSequenceFile is def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None). You need to send a Option[Class[_ <: CompressionCodec]] as codec. E.g.,

counts.saveAsSequenceFile(output, Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))

如果您仔细阅读类型不匹配的错误信息,则应该自己更正.

If you read the error information of type mismatch carefully, you should have corrected it by yourself.

这篇关于在Spark中压缩序列文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆