HDFS for Spark中gspipped Parquet文件是否可拆分? [英] Is gzipped Parquet file splittable in HDFS for Spark?

查看：154 发布时间：2020/9/4 6:02:26 apache-spark gzip parquet

本文介绍了HDFS for Spark中gspipped Parquet文件是否可拆分?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在互联网上搜索和阅读有关此主题的答案时，我收到令人困惑的消息.任何人都可以分享他们的经验吗?我知道一个事实，那就是gzip压缩的csv不是，但是Parquet的文件内部结构是如此，以至于Parquet vs csv的情况完全不同?

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

推荐答案

具有GZIP压缩的实木复合地板文件实际上是可拆分的.这是因为Parquet文件的内部布局.它们始终是可拆分的，与所使用的压缩算法无关.

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

这个事实主要是由于Parquet文件的设计分为以下几部分:

This fact is mainly due to the design of Parquet files that divided in the following parts:

每个Parquet文件由几个行组组成，它们的大小应与HDFS块大小相同.
每个行组每列包含一个ColumnChunk. RowGroup中的每个ColumnChunk具有相同数量的行.
ColumnChunks分为多个页面，这些页面的大小可能在64KiB到16MiB之间. 压缩是按页面进行的，因此页面是工作可以并行执行的最低级别.

Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.

您可以在此处找到更详细的说明: https://github.com/apache/parquet-format#file-format

You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

这篇关于HDFS for Spark中gspipped Parquet文件是否可拆分?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HDFS for Spark中gspipped Parquet文件是否可拆分? [英] Is gzipped Parquet file splittable in HDFS for Spark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

HDFS for Spark中gspipped Parquet文件是否可拆分? [英] Is gzipped Parquet file splittable in HDFS for Spark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭