从Hive表加载数据时,Spark SQL如何确定将使用的分区数? [英] How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?

查看:786
本文介绍了从Hive表加载数据时,Spark SQL如何确定将使用的分区数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与但是我认为这个问题没有得到正确的答案.请注意,问题是询问使用

But I think that question did not get a correct answer. Note that the question is asking how many partitions will be created when the dataframe is created as a result of executing a sql query against a HIVE table using SparkSession.sql method.

IIUC,上面的问题与询问执行诸如spark.read.json("examples/src/main/resources/people.json")这样的直接从文件系统加载数据的代码(可能是HDFS)导致创建数据帧时将创建多少个分区截然不同.我认为,后一个问题的答案由 spark.sql提供. files.maxPartitionBytes

IIUC, the question above is distinct from asking how many partitions will be created when the dataframe is created as a result of executing some code like spark.read.json("examples/src/main/resources/people.json") which loads the data directly from the filesystem - which could be HDFS. I think the answer to this latter question is given by spark.sql.files.maxPartitionBytes

spark.sql.files.maxPartitionBytes 134217728(128 MB)最大 读取文件时打包到单个分区中的字节数.

spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum number of bytes to pack into a single partition when reading files.

实验上,我尝试从HIVE表创建数据框,并且total data in hive table / spark.sql.files.maxPartitionBytes

Experimentally, I have tried creating a dataframe from a HIVE table and the # of partitions I get is not explained by total data in hive table / spark.sql.files.maxPartitionBytes

还添加到

Also adding to the OP, it would be good to know how can the number of partitions be controlled i.e., when one wants to force spark to use a different number than it would by default.

参考:

推荐答案

TL; DR::从Hive读取数据时的默认分区数将由HDFS blockSize决定.可以通过设置 mapreduce.job.maps 设置为适当的值,可以通过设置

TL;DR: The default number of partitions when reading data from Hive will be governed by the HDFS blockSize. The number of partitions can be increased by setting mapreduce.job.maps to appropriate value, and can be decreased by setting mapreduce.input.fileinputformat.split.minsize to appropriate value

Spark SQL创建

Spark SQL creates an instance of HadoopRDD when loading data from a hive table.

一种RDD,它提供用于读取存储在其中的数据的核心功能 Hadoop(例如,HDFS中的文件,HBase或S3中的源),使用较旧的 MapReduce API(org.apache.hadoop.mapred).

An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

HadoopRDD依次根据computeSplitSize方法拆分输入文件/mapreduce/lib/input/FileInputFormat.html"rel =" noreferrer> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (新API)和

HadoopRDD in turn splits input files according to the computeSplitSize method defined in org.apache.hadoop.mapreduce.lib.input.FileInputFormat (the new API) and org.apache.hadoop.mapred.FileInputFormat (the old API).

新API:

protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

旧API:

protected long computeSplitSize(long goalSize, long minSize,
                                       long blockSize) {
    return Math.max(minSize, Math.min(goalSize, blockSize));
  }

computeSplitSize根据HDFS blockSize拆分文件,但是如果blockSize小于minSize或大于maxSize,则将其限制在这些极限范围内. HDFS blockSize可以从

computeSplitSize splits files according to the HDFS blockSize but if the blockSize is less than minSize or greater than maxSize then it is clamped to those extremes. The HDFS blockSize can be obtained from

hdfs getconf -confKey dfs.blocksize

根据Hadoop权威指南表8.5,minSize是从mapreduce.input.fileinputformat.split.minsize获得的,而maxSize是从mapreduce.input.fileinputformat.split.maxsize获得的.

According to Hadoop the definitive guide Table 8.5, the minSize is obtained from mapreduce.input.fileinputformat.split.minsize and the maxSize is obtained from mapreduce.input.fileinputformat.split.maxsize.

但是,这本书还提到了有关mapreduce.input.fileinputformat.split.maxsize的内容:

However, the book also mentions regarding mapreduce.input.fileinputformat.split.maxsize that:

此属性不存在 在旧版的MapReduce API中( CombineFileInputFormat).相反,它是间接计算为 作业总投入的大小除以指导数 mapreduce.job.maps(或setNumMapTasks()指定的地图任务 JobConf上的方法).

This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapreduce.job.maps (or the setNumMapTasks() method on JobConf).

帖子也使用总输入大小除以地图任务.

this post also calculates the maxSize using the total input size divided by the number of map tasks.

这篇关于从Hive表加载数据时,Spark SQL如何确定将使用的分区数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆