需要了解 Dataframe Spark 中的分区细节 [英] Need to Know Partitioning Details in Dataframe Spark

查看：34 发布时间：2021/11/14 22:41:23 apache-spark apache-spark-sql spark-dataframe

本文介绍了需要了解 Dataframe Spark 中的分区细节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试根据查询从 DB2 数据库中读取数据.查询的结果集大约有 20 - 4000 万条记录.DF 的分区是基于一个整数列完成的.

I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer.

我的问题是，一旦加载了数据，我如何检查每个分区创建了多少条记录.基本上我想检查的是数据倾斜是否发生?如何检查每个分区的记录数?

My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition?

推荐答案

例如，您可以映射分区并确定它们的大小:

You can for instance map over the partitions and determine their sizes:

val rdd = sc.parallelize(0 until 1000, 3)
val partitionSizes = rdd.mapPartitions(iter => Iterator(iter.length)).collect()

// would be Array(333, 333, 334) in this example

这适用于 RDD 和 Dataset/DataFrame API.

This works for both the RDD and the Dataset/DataFrame API.

这篇关于需要了解 Dataframe Spark 中的分区细节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

需要了解 Dataframe Spark 中的分区细节 [英] Need to Know Partitioning Details in Dataframe Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

需要了解 Dataframe Spark 中的分区细节 [英] Need to Know Partitioning Details in Dataframe Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭