处理hadoop / hive中损坏的gzip文件 [英] handle corrupted gzip files in hadoop / hive

查看：797 发布时间：2017/8/28 1:08:31 hadoop error-handling hive corrupt

本文介绍了处理hadoop / hive中损坏的gzip文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的日常文件夹包含大量文本文件的HDFS上有很多tar.gz文件。

处理这些文件时，一些tar.gz被发现被破坏，导致hive / mapreduce作业与意外的流结果崩溃。

I have daily folders with a lot of tar.gz files on HDFS containing a large number of text files.
A number of those tar.gz were found to be corrupted and cause hive/mapreduce jobs to crash with an "unexpected end of stream" when processing those files.

我确定了其中的一些，并使用tar -zxvf进行了测试。他们确实退出了一个错误，但仍然提取了相当数量的文件，在这种情况发生之前。

I identified a few of those and tested them with tar -zxvf. They indeed exit with an error but still extract a decent number of files before this happens.

有没有办法停止蜂巢/ mapreduce工作， gz文件已损坏？
我已经测试了一些错误跳过和容错参数，如

mapred.skip.attempts.to.start.skipping，

mapred.skip.map.max .skip.records，

mapred.skip.mode.enabled，

mapred.map.max.attempts，

mapred.max.map.failures。百分比，

mapreduce.map.failures.maxpercent。

Is there a way to stop hive/mapreduce jobs to simply crash when a tar/gz file is corrupted? I've tested some error skipping and failure tolerance parameters such as
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max.skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures.percent,
mapreduce.map.failures.maxpercent.

它帮助少数情况下获取一个完整的文件夹，而没有崩溃，但主要是这导致工作挂起，完全没有完成。

It helped in a small number of cases to get a complete folder processed without crashing but mostly this caused the job to hang and not finish at all.

解压缩hadoop之外的每个文件只是为了将它们重新压缩（以获得干净的gzip文件）再次上传到hdfs将是一个这么痛苦的过程（因为额外的步骤和大量的数据将产生）

Unzipping every single file outside hadoop just to recompress them aftewards (to get clean gzip files) to then upload to hdfs again would be such a painful process (because of the extra steps and the large volume of data this would generate)

有更清洁/更优雅的解决方案有人找到了？

Is there a cleaner / more elegant solution that someone has found?

感谢任何帮助。

处理hadoop / hive中损坏的gzip文件 [英] handle corrupted gzip files in hadoop / hive

问题描述

推荐答案

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

处理hadoop / hive中损坏的gzip文件 [英] handle corrupted gzip files in hadoop / hive

问题描述

推荐答案

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭