在S3中存储时正确的Parquet文件大小吗? [英] Correct Parquet file size when storing in S3?

查看：122 发布时间：2021/5/13 20:52:40 apache-spark hdfs parquet

本文介绍了在S3中存储时正确的Parquet文件大小吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在阅读有关此主题的几个问题，也阅读过几个论坛，在所有这些论坛中，他们似乎都提到从Spark生成的每个.parquet文件的大小应为64MB或1GB，但仍然可以我不介意哪种情况属于每种文件大小，其背后的原因除了HDFS将它们分成64MB的块.

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.

我当前的测试方案如下.

My current testing scenario is the following.

dataset
  .coalesce(n) # being 'n' 4 or 48 - reasons explained below.
  .write
  .mode(SaveMode.Append)
  .partitionBy(CONSTANTS)
  .option("basepath", outputPath)
  .parquet(outputPath)

我目前总共处理2.5GB到3GB的每日数据，这些数据将被拆分并每年保存到每日存储桶中."n"为4或48的原因仅用于测试目的，因为我提前知道测试集的大小，所以我会尽量获取接近64MB或1GB的数字.在获得需要保存的确切大小之前，我尚未实现用于缓冲所需数据的代码.

I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.

所以我的问题是...

So my question here is...

如果我不打算使用HDFS，而只是从S3存储和检索数据，我应该考虑这么多的大小吗?

Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?

而且，如果我打算使用HDFS存储生成的.parquet文件，那应该是大约10GB 最大的每日数据集的最佳大小?

And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?

任何其他优化技巧都将不胜感激！

Any other optimization tip would be really appreciated!

在S3中存储时正确的Parquet文件大小吗? [英] Correct Parquet file size when storing in S3?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在S3中存储时正确的Parquet文件大小吗? [英] Correct Parquet file size when storing in S3?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭