有一个大的镶木地板文件或许多较小的镶木地板文件是更好的吗? [英] Is it better to have one large parquet file or lots of smaller parquet files?

查看:109
本文介绍了有一个大的镶木地板文件或许多较小的镶木地板文件是更好的吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道hdfs会将文件分割成64mb的块。我们有流媒体数据,我们可以将它们存储到大文件或中等大小的文件中。柱状文件存储的最佳大小是多少?如果我可以将文件存储到最小的列为64mb的位置,那么它是否可以节省计算时间,比如拥有1GB文件? 解决方案

每个文件大约1GB(火花分区)(1)。

理想情况下,由于易碎的压缩木条文件是可拆分的(2),因此您会使用快速压缩(默认)。

使用snappy而不是gzip会显着增加文件大小,所以如果存储空间是个问题,那就需要考虑。



.option(compression,gzip)是覆盖默认快速压缩的选项。



如果您需要调整/重新分配数据集/ DataFrame / RDD,请调用 .coalesce(< num_partitions> 或最坏情况 .repartition(< num_partitions>)函数。警告:特别是重新分区,但是coalesce可能会导致数据重新洗牌,所以请谨慎使用。



此外,实木复合地板文件的大小,对于这个问题,所有文件的大小一般都应该大于HDFS块的大小(默认128MB)。 b
$ b <1> https://forums.databricks的.com /问题/ 101 /什么-是-一个最佳大小的换文件的分区-using.html
2) http://boristyukin.com/is -snappy-compressed-parquet-file-splittable /


I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files?

解决方案

Aim for around 1GB per file (spark partition) (1).

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2).

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function. Warning: repartition especially but also coalesce can cause a reshuffle of the data, so use with some caution.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 2) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

这篇关于有一个大的镶木地板文件或许多较小的镶木地板文件是更好的吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆