如何将压缩包加载到猪 [英] how to load a tarball to pig
问题描述
我有一个 tarball (access.logs.tar.gz) 中的日志文件加载到我的 hadoop 集群中.我想知道他们的方法是直接将它加载到猪身上而不去解皮吗?
PigStorage 将识别文件被压缩(通过 .gz 扩展名,这实际上是在 PigTextInputFormat 扩展的 TextInputFormat 中实现的),但之后你将处理 tar 文件.如果您能够处理 tar 中文件之间的标题行,那么您可以按原样使用 PigStorage,否则您需要编写自己的 PigTextInputFormat 扩展来处理删除每个文件之间的 tar 标题行>
i have a log files that is in a tarball (access.logs.tar.gz) loaded into my hadoop cluster. I was wondering is their way to directly load it to pig with out untaring it?
PigStorage will recognize the file is compressed (by the .gz extension, this is actually implemented in the TextInputFormat which PigTextInputFormat extends), but after that you'll be dealing with a tar file. If you're able to handle the header lines between the files in the tar then you can just use PigStorage as is, otherwise you'll need to write your own extension of PigTextInputFormat to handle stripping out the tar header lines between each file
这篇关于如何将压缩包加载到猪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!