为什么hadoop不能分割一个大的文本文件，然后使用gzip压缩分割？ [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

查看：1074 发布时间：2016/12/25 12:20:08 compression hadoop gzip hdfs

本文介绍了为什么hadoop不能分割一个大的文本文件，然后使用gzip压缩分割？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近一直在调查hadoop和HDFS。当您将文件加载到HDFS时，它通常会将文件拆分为64MB的块，并将这些块分布在您的群集周围。除了它不能用gzip'd文件，因为一个gzip'd文件不能拆分。我完全了解为什么是这种情况（我不需要任何人解释为什么gzip'd文件不能拆分）。但是为什么HDFS不会把一个纯文本文件作为输入，并像正常一样拆分，然后使用gzip单独压缩每个拆分？当访问任何分割时，它只是被即时解压缩。

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

在我的场景中，每个分割完全独立压缩。分割之间没有依赖关系，因此您不需要整个原始文件来解压缩任何一个分割。这是此补丁需要采取的方法： https://issues.apache.org/jira/browse / HADOOP-7076 ，请注意这是不我想要的。

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

这看起来很基本...我错过了吗？为什么不能这样做呢？或者如果可以做到，为什么hadoop开发人员不去看这条路线？看起来很奇怪，因为我发现了多少讨论有关人们想要在HDFS中拆分gzip'd文件。

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

为什么hadoop不能分割一个大的文本文件，然后使用gzip压缩分割？ [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么hadoop不能分割一个大的文本文件，然后使用gzip压缩分割？ [英] Why can&#39;t hadoop split up a large text file and then compress the splits using gzip?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

为什么hadoop不能分割一个大的文本文件，然后使用gzip压缩分割？ [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

登录关闭