如何在 Spark SQL 中控制分区大小 [英] How to control partition size in Spark SQL
问题描述
我需要使用 Spark SQL HiveContext
从 Hive 表加载数据并加载到 HDFS.默认情况下,来自 SQL 输出的 DataFrame
有 2 个分区.为了获得更多的并行性,我需要更多的 SQL 分区.HiveContex
t 中没有重载方法来获取分区数参数.
I have a requirement to load data from an Hive table using Spark SQL HiveContext
and load into HDFS. By default, the DataFrame
from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContex
t to take number of partitions parameter.
RDD 的重新分区会导致改组并导致更多的处理时间.
Repartitioning of the RDD causes shuffling and results in more processing time.
>
val result = sqlContext.sql("select * from bt_st_ent")
有日志输出:
Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)
我想知道有什么方法可以增加 SQL 输出的分区大小.
I would like to know is there any way to increase the partitions size of the SQL output.
推荐答案
Spark 2.0:
您可以使用 Hadoop 配置选项:
You can use Hadoop configuration options:
mapred.min.split.size
.mapred.max.split.size
以及用于控制基于文件系统格式*的分区大小的 HDFS 块大小.
as well as HDFS block size to control partition size for filesystem based formats*.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
Spark 2.0+:
你可以使用spark.sql.files.maxPartitionBytes
配置:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
在这两种情况下,特定数据源 API 可能不会使用这些值,因此您应始终检查所用格式的文档/实现细节.
In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.
* 其他输入格式可以使用不同的设置.参见示例
* Other input formats can use different settings. See for example
此外,从 RDDs
创建的 Datasets
将继承其父项的分区布局.
Furthermore Datasets
created from RDDs
will inherit partition layout from their parents.
类似的分桶表将使用在 Metastore 中定义的桶布局,桶和 Dataset
分区之间的关系为 1:1.
Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset
partition.
这篇关于如何在 Spark SQL 中控制分区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!