处理hadoop / hive中损坏的gzip文件 [英] handle corrupted gzip files in hadoop / hive
问题描述
我的日常文件夹包含大量文本文件的HDFS上有很多tar.gz文件。
处理这些文件时,一些tar.gz被发现被破坏,导致hive / mapreduce作业与意外的流结果崩溃。
I have daily folders with a lot of tar.gz files on HDFS containing a large number of text files.
A number of those tar.gz were found to be corrupted and cause hive/mapreduce jobs to crash with an "unexpected end of stream" when processing those files.
我确定了其中的一些,并使用tar -zxvf进行了测试。他们确实退出了一个错误,但仍然提取了相当数量的文件,在这种情况发生之前。
I identified a few of those and tested them with tar -zxvf. They indeed exit with an error but still extract a decent number of files before this happens.
有没有办法停止蜂巢/ mapreduce工作, gz文件已损坏?
我已经测试了一些错误跳过和容错参数,如
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max .skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures。百分比,
mapreduce.map.failures.maxpercent。
Is there a way to stop hive/mapreduce jobs to simply crash when a tar/gz file is corrupted?
I've tested some error skipping and failure tolerance parameters such as
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max.skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures.percent,
mapreduce.map.failures.maxpercent.
它帮助少数情况下获取一个完整的文件夹,而没有崩溃,但主要是这导致工作挂起,完全没有完成。
It helped in a small number of cases to get a complete folder processed without crashing but mostly this caused the job to hang and not finish at all.
解压缩hadoop之外的每个文件只是为了将它们重新压缩(以获得干净的gzip文件)再次上传到hdfs将是一个这么痛苦的过程(因为额外的步骤和大量的数据将产生)
Unzipping every single file outside hadoop just to recompress them aftewards (to get clean gzip files) to then upload to hdfs again would be such a painful process (because of the extra steps and the large volume of data this would generate)
有更清洁/更优雅的解决方案有人找到了?
Is there a cleaner / more elegant solution that someone has found?
感谢任何帮助。
推荐答案
在这里的晚会太晚了,但我只是面对这个准确问题与损坏的gzip文件。我最终通过编写自己的 RecordReader
来解决问题,这将捕获 IOExceptions
,记录具有问题,然后优雅地丢弃该文件并移动到下一个。
I'm super late to the party here, but I just faced this exact issue with corrupt gzip files. I ended up solving it by writing my own RecordReader
which would catch IOExceptions
, log the name of the file that had a problem, and then gracefully discard that file and move on to the next one.
我已经写了一些细节(包括自定义的代码记录读者
这里: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/
I've written up some details (including code for the custom Record Reader
here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/
这篇关于处理hadoop / hive中损坏的gzip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!