拥有一个大的镶木地板文件还是许多较小的镶木地板文件更好? [英] Is it better to have one large parquet file or lots of smaller parquet files?

查看:33
本文介绍了拥有一个大的镶木地板文件还是许多较小的镶木地板文件更好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 hdfs 会将文件拆分为 64mb 之类的块.我们有流式传输的数据,我们可以将它们存储到大文件或中型文件中.列式文件存储的最佳大小是多少?如果我可以将文件存储到最小的列是 64mb 的位置,它会比拥有 1gb 的文件节省任何计算时间吗?

I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files?

推荐答案

目标是每个文件大约 1GB(spark 分区)(1).

Aim for around 1GB per file (spark partition) (1).

理想情况下,您将使用 snappy 压缩(默认),因为 snappy 压缩的镶木地板文件是可拆分的 (2).

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2).

使用 snappy 而不是 gzip 会显着增加文件大小,因此如果存储空间有问题,则需要考虑.

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") 是覆盖默认 snappy 压缩的选项.

.option("compression", "gzip") is the option to override the default snappy compression.

如果您需要调整/重新分区您的数据集/数据帧/RDD,请调用 .coalesce(<num_partitions> 或最坏的情况 .repartition(<num_partitions>) 函数.警告:特别是重新分区以及合并会导致数据重新洗牌,因此请谨慎使用.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function. Warning: repartition especially but also coalesce can cause a reshuffle of the data, so use with some caution.

此外,parquet 文件大小以及所有文件的大小通常应大于 HDFS 块大小(默认 128MB).

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html2) http://boristyukin.com/is-snappy-compressed-parquet-文件可拆分/

这篇关于拥有一个大的镶木地板文件还是许多较小的镶木地板文件更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆