通过读取Hive表中的数据创建的火花数据帧的分区数 [英] Number of partitions of a spark dataframe created by reading the data from Hive table
问题描述
如果我有包含列(名称,年龄,ID,位置)的Hive表(雇员)。
CREATE TABLE employee(name String,age String,id Int)PARTITIONED BY(location String);
如果employee表有10个不同的位置。因此,数据将被分为HDFS中的10个分区。
如果我通过读取Hive表(雇员)的全部数据来创建Spark数据框(df) p>
Spark会为数据框(df)创建多少个分区?
df.rdd.partitions .size = ??
根据HDFS的块大小创建分区。 b $ b
想象一下,您已经将10个分区作为单个RDD读取,并且如果块大小为128MB,那么
no分区=(大小为(10个分区以MB为单位))/ 128MB
将存储在HDFS上。
请参考以下链接: / p>
http://www.bigsynapse。 com / spark-input-output
I have question on spark dataframe number of partitions.
If I have Hive table(employee) which has columns (name,age,id,location).
CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);
If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.
If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).
How many number of partitions will be created by Spark for a dataframe(df)?
df.rdd.partitions.size = ??
Partitions are created depending on the block size of HDFS.
Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then
no of partitions = (size of(10 partitions in MBs)) / 128MB
will be stored on HDFS.
Please refer to the following link:
http://www.bigsynapse.com/spark-input-output
这篇关于通过读取Hive表中的数据创建的火花数据帧的分区数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!