Snappy是可拆分还是不可拆分? [英] Is Snappy splittable or not splittable?

查看:882
本文介绍了Snappy是可拆分还是不可拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据此



网上还有一些有用的信息。有人说这是可拆分的,有人说这不是。

解决方案

两者都是正确的,但在不同的级别。

Cloudera博客 http://blog.cloudera.com/blog/2011/ 09 / snappy-and-hadoop /



有一点需要注意,Snappy的目的是可以像序列文件或Avro数据文件那样用于

容器格式,而不是直接在纯文本上使用,例如,由于后者不可拆分并且不能并行处理MapReduce的。这与LZO不同,可以对LZO压缩文件进行索引以确定分割点,以便在随后的处理中可以有效地处理LZO文件。




这意味着如果整个文本文件使用Snappy进行压缩,那么文件不可拆分。但是,如果文件内的每条记录都是用Snappy压缩的,那么文件可能是可拆分的,例如在带有块压缩的序列文件中。



更加清楚,是不一样的:

 < START-FILE> 
< START-SNAPPY-BLOCK>
全文内容
< END-SNAPPY-BLOCK>
< END-FILE>



 < START-文件> 
< START-SNAPPY-BLOCK1>
RECORD1
< END-SNAPPY-BLOCK1>
< START-SNAPPY-BLOCK2>
RECORD2
< END-SNAPPY-BLOCK2>
< START-SNAPPY-BLOCK3>
RECORD3
< END-SNAPPY-BLOCK3>
< END-FILE>

活动区块不可分割,但活动区块的文件为splittables


According to this Cloudera post, Snappy IS splittable.

For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.

But from the hadoop definitive guide, Snappy is NOT splittable.

There are also some confilitcting information on the web. Some say it's splittable, some say it's not.

解决方案

Both are correct but in different levels.

According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

One thing to note is that Snappy is intended to be used with a
container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

To be more clear, is not the same:

<START-FILE>
  <START-SNAPPY-BLOCK>
     FULL CONTENT
  <END-SNAPPY-BLOCK>
<END-FILE>

than

<START-FILE>
  <START-SNAPPY-BLOCK1>
     RECORD1
  <END-SNAPPY-BLOCK1>
  <START-SNAPPY-BLOCK2>
     RECORD2
  <END-SNAPPY-BLOCK2>
  <START-SNAPPY-BLOCK3>
     RECORD3
  <END-SNAPPY-BLOCK3>
<END-FILE>

Snappy blocks are NOT splittable but files with snappy blocks are splittables.

这篇关于Snappy是可拆分还是不可拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆