如何在Spark SQL中控制分区大小 [英] How to control partition size in Spark SQL

查看:977
本文介绍了如何在Spark SQL中控制分区大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用Spark SQL HiveContext从Hive表中加载数据并加载到HDFS中.默认情况下,SQL输出中的DataFrame具有2个分区.为了获得更多的并行性,我需要从SQL中获得更多的分区. HiveContex t中没有重载方法来获取分区数参数.

RDD的重新分区会导致改组并导致更多的处理时间.

>

val result = sqlContext.sql("select * from bt_st_ent")

具有以下日志输出:

Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)

我想知道有什么方法可以增加SQL输出的分区大小.

解决方案

火花< 2.0 :

您可以使用Hadoop配置选项:

  • mapred.min.split.size.
  • mapred.max.split.size

以及HDFS块大小来控制基于文件系统格式*的分区大小.

 val minSplit: Int = ???
val maxSplit: Int = ???

sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
 

Spark 2.0 + :

您可以使用spark.sql.files.maxPartitionBytes配置:

 spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
 

在两种情况下,特定的数据源API可能都未使用这些值,因此您应始终检查所使用格式的文档/实现详细信息.


*其他输入格式可以使用不同的设置.参见示例

此外,从RDDs创建的Datasets将继承其父级的分区布局.

类似地,存储桶的表将使用元存储中定义的存储桶布局,存储桶与Dataset分区之间的比例为1:1.

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContext to take number of partitions parameter.

Repartitioning of the RDD causes shuffling and results in more processing time.

>

val result = sqlContext.sql("select * from bt_st_ent")

Has the log output of:

Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)

I would like to know is there any way to increase the partitions size of the SQL output.

解决方案

Spark < 2.0:

You can use Hadoop configuration options:

  • mapred.min.split.size.
  • mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.

val minSplit: Int = ???
val maxSplit: Int = ???

sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

Spark 2.0+:

You can use spark.sql.files.maxPartitionBytes configuration:

spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.


* Other input formats can use different settings. See for example

Furthermore Datasets created from RDDs will inherit partition layout from their parents.

Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.

这篇关于如何在Spark SQL中控制分区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆