spark.sql.files.maxPartitionBytes不限制写入分区的最大大小 [英] spark.sql.files.maxPartitionBytes not limiting max size of written partitions

查看:161
本文介绍了spark.sql.files.maxPartitionBytes不限制写入分区的最大大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将实木复合地板数据从另一个s3存储桶复制到我的s3存储桶.我想将每个分区的大小限制为最大128 MB.我认为默认情况下 spark.sql.files.maxPartitionBytes 会设置为128 MB,但是当我在复制后查看s3中的分区文件时,会看到大约226 MB的单个分区文件.我在看这篇帖子,建议我设置此spark config键以限制分区的最大大小:

I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe partition but it doesn't seem to work?

这是该配置键的定义:

在以下情况下打包到单个分区中的最大字节数读取文件.该配置仅在使用时有效基于文件的来源,例如Parquet,JSON和ORC.

The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

我也有点困惑这与书面实木复合地板文件的大小有关.

I'm also a bit confused how this relates to size of the written parquet files.

作为参考,我正在胶水版本1.0,spark 2.4上运行胶水脚本,脚本是这样的:

For reference, I am running a glue script on glue version 1.0, spark 2.4 and the script is this:

val conf: SparkConf = new SparkConf()
conf.set("spark.sql.catalogImplementation", "hive")
    .set("spark.hadoop.hive.metastore.glue.catalogid", catalogId)
val spark: SparkContext = new SparkContext(sparkConf)

val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession

val sqlDF = sparkSession.sql("SELECT * FROM db.table where id='item1'")
sqlDF.write.mode(SaveMode.Overwrite).parquet("s3://my-s3-location/")

推荐答案

在Spark上读取数据时,设置 spark.sql.files.maxPartitionBytes 确实会影响分区的最大大小.簇.如果输出后的最终文件太大,则建议减小此设置的值,并且应创建更多文件,因为输入数据将分布在更多分区中.但是,如果查询中有任何混洗,则情况将不会如此,因为这将始终将其重新划分为 spark.sql.shuffle.partitions 设置所给定的分区数.

The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster. If your final files after the output are too large, then I suggest decreasing the value of this setting and it should create more files because the input data will be distributed among more partitions. This will however not be true if you have any shuffle in your query because then it will be always repartitioned into the number of partitions given by spark.sql.shuffle.partitions setting.

此外,文件的最终大小取决于您将使用的文件格式和压缩率.因此,如果将数据输出到例如镶木地板中,则文件将比输出到csv或json小得多.

Also, the final size of your files will depend on the file format and compression that you will use. So if you output the data into for example parquet, the files will be much smaller than outputing to csv or json.

这篇关于spark.sql.files.maxPartitionBytes不限制写入分区的最大大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆