通过命令行在Hadoop中进行压缩编解码器检测 [英] Compression codec detection in Hadoop from the command line

查看:103
本文介绍了通过命令行在Hadoop中进行压缩编解码器检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种简单的方法来找出用于在Hadoop中压缩文件的编解码器?

Is there any simple way to find out the codec used to compress a file in Hadoop?

我是否需要编写Java程序或将文件添加到Hive,以便可以使用describe formatted table?

Do I need to write a Java program, or add the file to Hive so I can use describe formatted table?

推荐答案

一种方法是在本地下载文件(使用hdfs dfs -get命令),然后按照

One way to do it is to download a file locally (using hdfs dfs -get command) and then follow the procedure for detecting compression format for local files.

这对于Hadoop的外部压缩文件应该效果很好.对于在Hadoop内生成的文件,这仅适用于少数情况,例如用Gzip压缩的文本文件.

This should work quite well for files compressed outside of Hadoop. For files generated within Hadoop, this will work only for limited number of cases, e.g. text files compressed with Gzip.

在Hadoop中压缩的文件可能被称为容器格式",例如 Avro ,序列文件,镶木地板等这意味着不是压缩整个文件,而是压缩文件内部的仅大块数据.您提到的配置单元的describe formatted table命令确实可以帮助您确定基础文件的输入格式.

Files compressed within Hadoop are likely to be so called "container formats", e.g. Avro, Sequence Files, Parquet, etc. That means that not the entire file is compressed, but only chunks of data inside the file. The hive's describe formatted table command that you're mentioning can indeed help you to figure out the input format of underlying files.

一旦您知道文件格式,就应该参考文件格式的文档/源代码以获取编解码器检测方面的参考.某些文件格式甚至带有命令行工具,可以查看文件的元数据,从而揭示压缩编解码器.一些例子:

Once you know the file format, you should refer to the documentation/source code of the file format for the reference on codec detection. Some file formats even come with command line tools to look into the file's metadata which reveals the compression codec. Some examples:

Avro :

hadoop jar /path/to/avro-tools.jar getmeta FILE_LOCATION_ON_HDFS --key 'avro.codec'

镶木地板

hadoop jar /path/to/parquet-tools.jar meta FILE_LOCATION_ON_HDFS

这篇关于通过命令行在Hadoop中进行压缩编解码器检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆