为什么hadoop不能分割一个大的文本文件,然后使用gzip压缩分割? [英] Why can't hadoop split up a large text file and then compress the splits using gzip?

查看:1074
本文介绍了为什么hadoop不能分割一个大的文本文件,然后使用gzip压缩分割?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近一直在调查hadoop和HDFS。当您将文件加载到HDFS时,它通常会将文件拆分为64MB的块,并将这些块分布在您的群集周围。除了它不能用gzip'd文件,因为一个gzip'd文件不能拆分。我完全了解为什么是这种情况(我不需要任何人解释为什么gzip'd文件不能拆分)。但是为什么HDFS不会把一个纯文本文件作为输入,并像正常一样拆分,然后使用gzip单独压缩每个拆分?当访问任何分割时,它只是被即时解压缩。

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

在我的场景中,每个分割完全独立压缩。分割之间没有依赖关系,因此您不需要整个原始文件来解压缩任何一个分割。这是此补丁需要采取的方法: https://issues.apache.org/jira/browse / HADOOP-7076 ,请注意这是我想要的。

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

这看起来很基本...我错过了吗?为什么不能这样做呢?或者如果可以做到,为什么hadoop开发人员不去看这条路线?看起来很奇怪,因为我发现了多少讨论有关人们想要在HDFS中拆分gzip'd文件。

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

推荐答案

是分离关注点的设计原则。

The simple reason is the design principle of "separation of concerns".

如果你做你的建议,HDFS必须知道文件的实际位和字节是什么意思。还必须使HDFS能够对其进行推理(即提取,解压缩等)。
一般来说,你不想在软件中混淆这些职责。

If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software.

因此,唯一部分是理解这些位意味着什么必须能够读取它的应用程序:它通常使用Hadoop的MapReduce部分编写。

So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop.

如HADOOP-7076的Javadoc中所述;)):

As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ):


请务必记住
替代方法:

Always remember that there are alternative approaches:

HTH

这篇关于为什么hadoop不能分割一个大的文本文件,然后使用gzip压缩分割?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆