HDFS for Spark中gspipped Parquet文件是否可拆分? [英] Is gzipped Parquet file splittable in HDFS for Spark?

查看:154
本文介绍了HDFS for Spark中gspipped Parquet文件是否可拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在互联网上搜索和阅读有关此主题的答案时,我收到令人困惑的消息.任何人都可以分享他们的经验吗?我知道一个事实,那就是gzip压缩的csv不是,但是Parquet的文件内部结构是如此,以至于Parquet vs csv的情况完全不同?

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

推荐答案

具有GZIP压缩的实木复合地板文件实际上是可拆分的.这是因为Parquet文件的内部布局.它们始终是可拆分的,与所使用的压缩算法无关.

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

这个事实主要是由于Parquet文件的设计分为以下几部分:

This fact is mainly due to the design of Parquet files that divided in the following parts:

  1. 每个Parquet文件由几个行组组成,它们的大小应与HDFS块大小相同.
  2. 每个行组每列包含一个ColumnChunk. RowGroup中的每个ColumnChunk具有相同数量的行.
  3. ColumnChunks分为多个页面,这些页面的大小可能在64KiB到16MiB之间. 压缩是按页面进行的,因此页面是工作可以并行执行的最低级别.
  1. Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
  2. Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
  3. ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.

您可以在此处找到更详细的说明: https://github.com/apache/parquet-format#file-format

You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

这篇关于HDFS for Spark中gspipped Parquet文件是否可拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆