处理hadoop / hive中损坏的gzip文件 [英] handle corrupted gzip files in hadoop / hive

查看:797
本文介绍了处理hadoop / hive中损坏的gzip文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我的日常文件夹包含大量文本文件的HDFS上有很多tar.gz文件。

处理这些文件时,一些tar.gz被发现被破坏,导致hive / mapreduce作业与意外的流结果崩溃。


I have daily folders with a lot of tar.gz files on HDFS containing a large number of text files.
A number of those tar.gz were found to be corrupted and cause hive/mapreduce jobs to crash with an "unexpected end of stream" when processing those files.

我确定了其中的一些,并使用tar -zxvf进行了测试。他们确实退出了一个错误,但仍然提取了相当数量的文件,在这种情况发生之前。

I identified a few of those and tested them with tar -zxvf. They indeed exit with an error but still extract a decent number of files before this happens.

有没有办法停止蜂巢/ mapreduce工作, gz文件已损坏?
我已经测试了一些错误跳过和容错参数,如

mapred.skip.attempts.to.start.skipping,

mapred.skip.map.max .skip.records,

mapred.skip.mode.enabled,

mapred.map.max.attempts,

mapred.max.map.failures。百分比,

mapreduce.map.failures.maxpercent。

Is there a way to stop hive/mapreduce jobs to simply crash when a tar/gz file is corrupted? I've tested some error skipping and failure tolerance parameters such as
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max.skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures.percent,
mapreduce.map.failures.maxpercent.

它帮助少数情况下获取一个完整的文件夹,而没有崩溃,但主要是这导致工作挂起,完全没有完成。

It helped in a small number of cases to get a complete folder processed without crashing but mostly this caused the job to hang and not finish at all.

解压缩hadoop之外的每个文件只是为了将它们重新压缩(以获得干净的gzip文件)再次上传到hdfs将是一个这么痛苦的过程(因为额外的步骤和大量的数据将产生)

Unzipping every single file outside hadoop just to recompress them aftewards (to get clean gzip files) to then upload to hdfs again would be such a painful process (because of the extra steps and the large volume of data this would generate)

有更清洁/更优雅的解决方案有人找到了?

Is there a cleaner / more elegant solution that someone has found?

感谢任何帮助。

推荐答案

在这里的晚会太晚了,但我只是面对这个准确问题与损坏的gzip文件。我最终通过编写自己的 RecordReader 来解决问题,这将捕获 IOExceptions ,记录具有问题,然后优雅地丢弃该文件并移动到下一个。

I'm super late to the party here, but I just faced this exact issue with corrupt gzip files. I ended up solving it by writing my own RecordReader which would catch IOExceptions, log the name of the file that had a problem, and then gracefully discard that file and move on to the next one.

我已经写了一些细节(包括自定义的代码记录读者这里: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

I've written up some details (including code for the custom Record Reader here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

这篇关于处理hadoop / hive中损坏的gzip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆