Hive gzip文件解压 [英] Hive gzip file decompression

查看:937
本文介绍了Hive gzip文件解压的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经将一堆.gz文件加载到HDFS中,并且当我在其上创建一个原始表格时,在计算行数时会看到奇怪的行为。比较gz表与未压缩表的计数(*)的结果导致〜85%的差异。具有gz压缩文件的表的记录较少。有没有人看过这个?

  CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
col1 string,col2 string,col3 string)
ROW FORMAT DELIMITED
线路终止于'\ n'
LOCATION'/ data / raw / test_gz'
;

从test_gz中选择count(*);结果1,123,456
从测试中选择计数(*);结果7,720,109


解决方案

我能解决这个问题。不知何故gzip文件在map / reduce作业(配置单元或自定义java map / reduce)中没有完全解压缩。 Mapreduce作业只能读取大约450 MB的gzip文件,并在没有完全读取3.5GZ文件的情况下将数据写出到HDFS。奇怪的是,没有任何错误!

由于这些文件是在另一台服务器上压缩的,我手动解压缩了它们并在hadoop客户端服务器上重新压缩它们。之后,我将新压缩的3.5GZ文件上传到HDFS,然后配置单元可以完整记录读取整个文件的所有记录。



Marcin


I have loaded bunch of .gz file into HDFS and when I create a raw table on top of them I am seeing strange behavior when counting number of rows. Comparing the result of the count(*) from the gz table versus the uncompressed table results in ~85% difference. The table that has the file gz compressed has less records. Has anyone seen this?

CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
  col1 string, col2 string, col3 string)
ROW FORMAT DELIMITED
   LINES TERMINATED BY '\n'
LOCATION '/data/raw/test_gz'
;

select count(*) from test_gz;    result 1,123,456
select count(*) from test;  result 7,720,109

解决方案

I was able to resolve this issue. Somehow the gzip files were not fully getting decompressed in map/reduce jobs (hive or custom java map/reduce). Mapreduce job would only read about ~450 MB of the gzip file and write out the data out to HDFS without fully reading the 3.5GZ file. Strange, no errors at all!

Since the files were compressed on another server, I decompressed them manually and re-compressed them on the hadoop client server. After that, I uploaded the newly compressed 3.5GZ file to HDFS, and then hive was able to fully count all the records reading the whole file.

Marcin

这篇关于Hive gzip文件解压的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆