如何在Spark SQL中控制分区大小 [英] How to control partition size in Spark SQL
问题描述
我需要使用Spark SQL HiveContext
从Hive表中加载数据并加载到HDFS中.默认情况下,SQL输出中的DataFrame
具有2个分区.为了获得更多的并行性,我需要从SQL中获得更多的分区. HiveContex
t中没有重载方法来获取分区数参数.
RDD的重新分区会导致改组并导致更多的处理时间.
>
val result = sqlContext.sql("select * from bt_st_ent")
具有以下日志输出:
Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)
我想知道有什么方法可以增加SQL输出的分区大小.
火花< 2.0 :
您可以使用Hadoop配置选项:
-
mapred.min.split.size
. -
mapred.max.split.size
以及HDFS块大小来控制基于文件系统格式*的分区大小.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
Spark 2.0 + :
您可以使用spark.sql.files.maxPartitionBytes
配置:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
在两种情况下,特定的数据源API可能都未使用这些值,因此您应始终检查所使用格式的文档/实现详细信息.
*其他输入格式可以使用不同的设置.参见示例
此外,从RDDs
创建的Datasets
将继承其父级的分区布局.
类似地,存储桶的表将使用元存储中定义的存储桶布局,存储桶与Dataset
分区之间的比例为1:1.
I have a requirement to load data from an Hive table using Spark SQL HiveContext
and load into HDFS. By default, the DataFrame
from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContex
t to take number of partitions parameter.
Repartitioning of the RDD causes shuffling and results in more processing time.
>
val result = sqlContext.sql("select * from bt_st_ent")
Has the log output of:
Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)
I would like to know is there any way to increase the partitions size of the SQL output.
Spark < 2.0:
You can use Hadoop configuration options:
mapred.min.split.size
.mapred.max.split.size
as well as HDFS block size to control partition size for filesystem based formats*.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
Spark 2.0+:
You can use spark.sql.files.maxPartitionBytes
configuration:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.
* Other input formats can use different settings. See for example
- Partitioning in spark while reading from RDBMS via JDBC
- Difference between mapreduce split and spark paritition
Furthermore Datasets
created from RDDs
will inherit partition layout from their parents.
Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset
partition.
这篇关于如何在Spark SQL中控制分区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!