为什么 hadoop 不能拆分一个大的文本文件,然后使用 gzip 压缩拆分? [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

查看:18
本文介绍了为什么 hadoop 不能拆分一个大的文本文件,然后使用 gzip 压缩拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近一直在研究 hadoop 和 HDFS.当您将文件加载到 HDFS 中时,它通常会将文件拆分为 64MB 的块并将这些块分布在您的集群中.除非它不能对 gzip 文件执行此操作,因为无法拆分 gzip 文件.我完全理解为什么会这样(我不需要任何人解释为什么不能拆分 gzip 文件).但是为什么 HDFS 不能将纯文本文件作为输入并正常拆分,然后分别使用 gzip 压缩每个拆分?当访问任何拆分时,它只是即时解压缩.

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

在我的场景中,每个分割都是完全独立压缩的.拆分之间没有依赖关系,因此您不需要整个原始文件来解压缩任何一个拆分.这就是这个补丁所采用的方法:https://issues.apache.org/jira/browse/HADOOP-7076,请注意这不是我想要的.

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

这看起来很基本……我错过了什么?为什么不能这样做?或者,如果可以做到,为什么 hadoop 开发人员不看这条路呢?考虑到人们想要在 HDFS 中拆分 gzip 文件,我发现了多少讨论,这似乎很奇怪.

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

推荐答案

原因很简单,就是关注点分离"的设计原则.

The simple reason is the design principle of "separation of concerns".

如果你按照你的建议去做,那么 HDFS 必须知道文件的实际位和字节是什么意思.还必须使 HDFS 能够对其进行推理(即提取、解压缩等).一般来说,您不希望在软件中出现这种混淆的职责.

If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software.

因此,理解位的唯一"部分是应用程序必须能够读取它:通常使用 Hadoop 的 MapReduce 部分编写.

So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop.

如 HADOOP-7076 的 Javadoc 中所述(我写了那个东西;)):

As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ):

永远记住有替代方法:

Always remember that there are alternative approaches:

HTH

这篇关于为什么 hadoop 不能拆分一个大的文本文件,然后使用 gzip 压缩拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆