spark.sql.files.maxPartitionBytes 不限制写入分区的最大大小 [英] spark.sql.files.maxPartitionBytes not limiting max size of written partitions

查看:271
本文介绍了spark.sql.files.maxPartitionBytes 不限制写入分区的最大大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Parquet 数据从另一个 s3 存储桶复制到我的 s3 存储桶.我想将每个分区的大小限制为最大 128 MB.我认为默认情况下 spark.sql.files.maxPartitionBytes 会设置为 128 MB,但是当我在复制后查看 s3 中的分区文件时,我看到的单个分区文件大约为 226 MB.我正在看这篇文章,它建议我设置这个 spark 配置键以限制我的分区的最大大小:限制数据帧分区的最大大小 但它似乎不起作用?

I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe partition but it doesn't seem to work?

这是那个配置键的定义:

This is the definition of that config key:

打包到单个分区的最大字节数读取文件.此配置仅在使用时有效基于文件的源,例如 Parquet、JSON 和 ORC.

The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

我也有点困惑这与写入的镶木地板文件的大小有何关系.

I'm also a bit confused how this relates to size of the written parquet files.

作为参考,我在胶水版本 1.0、火花 2.4 上运行胶水脚本,脚本如下:

For reference, I am running a glue script on glue version 1.0, spark 2.4 and the script is this:

val conf: SparkConf = new SparkConf()
conf.set("spark.sql.catalogImplementation", "hive")
    .set("spark.hadoop.hive.metastore.glue.catalogid", catalogId)
val spark: SparkContext = new SparkContext(sparkConf)

val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession

val sqlDF = sparkSession.sql("SELECT * FROM db.table where id='item1'")
sqlDF.write.mode(SaveMode.Overwrite).parquet("s3://my-s3-location/")

推荐答案

spark.sql.files.maxPartitionBytes 的设置确实对 Spark 上读取数据时分区的最大大小有影响簇.如果输出后的最终文件太大,那么我建议减小此设置的值,它应该创建更多文件,因为输入数据将分布在更多分区中.但是,如果您的查询中有任何 shuffle,这将不会是正确的,因为这样它将始终重新分区为 spark.sql.shuffle.partitions 设置给定的分区数.

The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster. If your final files after the output are too large, then I suggest decreasing the value of this setting and it should create more files because the input data will be distributed among more partitions. This will however not be true if you have any shuffle in your query because then it will be always repartitioned into the number of partitions given by spark.sql.shuffle.partitions setting.

此外,文件的最终大小将取决于您将使用的文件格式和压缩方式.因此,如果您将数据输出到例如 parquet,文件将比输出到 csv 或 json 小得多.

Also, the final size of your files will depend on the file format and compression that you will use. So if you output the data into for example parquet, the files will be much smaller than outputing to csv or json.

这篇关于spark.sql.files.maxPartitionBytes 不限制写入分区的最大大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆