在S3中存储时正确的Parquet文件大小吗? [英] Correct Parquet file size when storing in S3?

查看:122
本文介绍了在S3中存储时正确的Parquet文件大小吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读有关此主题的几个问题,也阅读过几个论坛,在所有这些论坛中,他们似乎都提到从Spark生成的每个.parquet文件的大小应为64MB或1GB,但仍然可以我不介意哪种情况属于每种文件大小,其背后的原因除了HDFS将它们分成64MB的块.

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.

我当前的测试方案如下.

My current testing scenario is the following.

dataset
  .coalesce(n) # being 'n' 4 or 48 - reasons explained below.
  .write
  .mode(SaveMode.Append)
  .partitionBy(CONSTANTS)
  .option("basepath", outputPath)
  .parquet(outputPath)

我目前总共处理2.5GB到3GB的每日数据,这些数据将被拆分并每年保存到每日存储桶中."n"为4或48的原因仅用于测试目的,因为我提前知道测试集的大小,所以我会尽量获取接近64MB或1GB的数字.在获得需要保存的确切大小之前,我尚未实现用于缓冲所需数据的代码.

I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.

所以我的问题是...

So my question here is...

如果我不打算使用HDFS,而只是从S3存储和检索数据,我应该考虑这么多的大小吗?

Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?

而且,如果我打算使用HDFS存储生成的.parquet文件,那应该是大约10GB 最大的每日数据集的最佳大小?

And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?

任何其他优化技巧都将不胜感激!

Any other optimization tip would be really appreciated!

推荐答案

您可以控制镶木地板文件的拆分大小,前提是您使用可拆分的压缩文件(如snappy)保存它们..对于s3a连接器,只需将 fs.s3a.block.size 设置为不同的字节数.

You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.

较小的分割尺寸

  • 更多工作人员可以同时处理文件.如果您有空闲的工作人员,则可以加快速度.
  • 更多启动开销计划工作,开始处理,提交任务
  • 除非重新分区,否则从输出创建更多文件.

小文件vs大文件

小文件:

  • 无论您是否想要,都会得到很小的一笔分.
  • 即使您使用不可拆分的压缩方式.
  • 花费更长的时间列出文件.在s3上列出目录树非常慢
  • 不可能要求块大小大于文件长度
  • 如果您的s3客户端未按块进行增量写入,则更易于保存.(如果您将 spark.hadoop.fs.s3a.fast.upload设置为true ,则Hadoop 2.8+会这样做.
  • you get that small split whether or not you want it.
  • even if you use unsplittable compression.
  • takes longer to list files. Listing directory trees on s3 is very slow
  • impossible to ask for larger block sizes than the file length
  • easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.

个人,这是观点,是一些基准测试驱动的,但不是您的查询

写作

  • 保存到更大的文件.
  • 有生气.
  • 在较深和较窄的范围内,shallower + wider目录树

阅读

  • 以不同的方块大小进行游戏;最少处理32-64 MB
  • Hadoop 3.1,请使用零重命名提交程序.否则,请切换至v2
  • 如果您的FS连接器支持此功能,请确保打开随机IO(hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
  • 通过 .repartion()保存到较大的文件.
  • 请注意收集多少数据,因为通过存储大量旧数据很容易产生大笔账单.

另请参见使用S3/ADLS/WASB改善Spark性能

这篇关于在S3中存储时正确的Parquet文件大小吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆