S3 和 Spark:文件大小和文件格式最佳实践 [英] S3 and Spark: File size and File format best practices

查看:122
本文介绍了S3 和 Spark:文件大小和文件格式最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过 PySpark 从 S3 读取数据(源自具有 5 列的 RedShift 表,表的总大小约为 500gb - 1tb)以进行日常批处理作业.

I need to read data (originating from a RedShift table with 5 columns, total size of the table is on the order of 500gb - 1tb) from S3 into Spark via PySpark for a daily batch job.

是否有最佳实践:

  • 关于如何在 S3 中存储数据的首选文件格式?(格式重要吗?)
  • 最佳文件大小?

任何可以为我指明正确方向的资源/链接也可以使用.

Any resources/links that can point me in the right direction would also work.

谢谢!

推荐答案

这篇博文提供了有关该主题的一些重要信息:

This blog post has some great info on the subject:

https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/

查看标题为:为您的用例使用最佳数据存储的部分

Look at the section titled: Use the Best Data Store for Your Use Case

根据个人经验,我更喜欢在大多数情况下使用 parquet,因为我通常将数据写出一次,然后多次读取(用于分析).

From personal experience, I prefer using parquet in most scenarios, because I’m usually writing the data out once, and then reading it many times (for analytics).

就文件数量而言,我喜欢在 200 到 1,000 之间.这允许所有大小的集群并行读写,并允许我高效地读取数据,因为使用 parquet 我可以只放大我感兴趣的文件.如果文件太多,有一个spark 记住所有文件名和位置的开销很大,如果文件太少,它就无法有效地并行读取和写入.

In terms of numbers of files, I like to have between 200 and 1,000. This allows clusters of all sizes to read and write in parallel, and allows my reading of the data to be efficient because with parquet I can zoom in on just the file I’m interested in. If you have too many files, there is a ton of overhead in spark remembering all the file names and locations, and if you have too few files, it can’t parallelize your reads and writes effectively.

在使用镶木地板时,我发现文件大小不如文件数量重要.

File size I have found to be less important than number of files, when using parquet.

那篇博文中有一个很好的部分描述了我喜欢使用镶木地板的原因:

Here’s a good section from that blog post that describes why I like to use parquet:

Apache Parquet 使用 Spark 提供最快的读取性能.Parquet 按列排列数据,将相关值彼此靠近以优化查询性能、最小化 I/O 并促进压缩.Parquet 使用一种节省资源的技术来检测和编码相同或相似的数据.Parquet 还存储列元数据和统计信息,可以向下推以过滤列(在下面讨论).Spark 2.x 具有矢量化 Parquet 读取器,可在列批次中进行解压缩和解码,读取性能提高约 10 倍.

Apache Parquet gives the fastest read performance with Spark. Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data, using a technique that conserves resources. Parquet also stores column metadata and statistics, which can be pushed down to filter columns (discussed below). Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.

这篇关于S3 和 Spark:文件大小和文件格式最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆