通过读取Hive表中的数据创建的火花数据帧的分区数 [英] Number of partitions of a spark dataframe created by reading the data from Hive table

查看：126 发布时间：2018/6/12 13:58:44 hive apache-spark-sql

本文介绍了通过读取Hive表中的数据创建的火花数据帧的分区数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我有包含列（名称，年龄，ID，位置）的Hive表（雇员）。

CREATE TABLE employee（name String，age String，id Int）PARTITIONED BY（location String）;

如果employee表有10个不同的位置。因此，数据将被分为HDFS中的10个分区。

如果我通过读取Hive表（雇员）的全部数据来创建Spark数据框（df） p>

Spark会为数据框（df）创建多少个分区？

df.rdd.partitions .size = ??

解决方案

根据HDFS的块大小创建分区。 b $ b

想象一下，您已经将10个分区作为单个RDD读取，并且如果块大小为128MB，那么

no分区=（大小为（10个分区以MB为单位））/ 128MB

将存储在HDFS上。

请参考以下链接： / p>

http：//www.bigsynapse。 com / spark-input-output

I have question on spark dataframe number of partitions.

If I have Hive table(employee) which has columns (name,age,id,location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.

If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).

How many number of partitions will be created by Spark for a dataframe(df)?

df.rdd.partitions.size = ??
解决方案
Partitions are created depending on the block size of HDFS.

Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then

no of partitions = (size of(10 partitions in MBs)) / 128MB

will be stored on HDFS.

Please refer to the following link:

http://www.bigsynapse.com/spark-input-output

这篇关于通过读取Hive表中的数据创建的火花数据帧的分区数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过读取Hive表中的数据创建的火花数据帧的分区数 [英] Number of partitions of a spark dataframe created by reading the data from Hive table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过读取Hive表中的数据创建的火花数据帧的分区数 [英] Number of partitions of a spark dataframe created by reading the data from Hive table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭