为什么 hadoop 不能拆分一个大的文本文件，然后使用 gzip 压缩拆分? [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

查看：18 发布时间：2021/12/15 19:19:20 compression hadoop gzip hdfs

本文介绍了为什么 hadoop 不能拆分一个大的文本文件，然后使用 gzip 压缩拆分?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近一直在研究 hadoop 和 HDFS.当您将文件加载到 HDFS 中时，它通常会将文件拆分为 64MB 的块并将这些块分布在您的集群中.除非它不能对 gzip 文件执行此操作，因为无法拆分 gzip 文件.我完全理解为什么会这样(我不需要任何人解释为什么不能拆分 gzip 文件).但是为什么 HDFS 不能将纯文本文件作为输入并正常拆分，然后分别使用 gzip 压缩每个拆分?当访问任何拆分时，它只是即时解压缩.

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

在我的场景中，每个分割都是完全独立压缩的.拆分之间没有依赖关系，因此您不需要整个原始文件来解压缩任何一个拆分.这就是这个补丁所采用的方法:https://issues.apache.org/jira/browse/HADOOP-7076，请注意这不是我想要的.

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

这看起来很基本……我错过了什么?为什么不能这样做?或者，如果可以做到，为什么 hadoop 开发人员不看这条路呢?考虑到人们想要在 HDFS 中拆分 gzip 文件，我发现了多少讨论，这似乎很奇怪.

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

为什么 hadoop 不能拆分一个大的文本文件，然后使用 gzip 压缩拆分? [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么 hadoop 不能拆分一个大的文本文件，然后使用 gzip 压缩拆分? [英] Why can&#39;t hadoop split up a large text file and then compress the splits using gzip?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

为什么 hadoop 不能拆分一个大的文本文件，然后使用 gzip 压缩拆分? [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

登录关闭