需要了解Dataframe Spark中的分区详细信息 [英] Need to Know Partitioning Details in Dataframe Spark

查看：77 发布时间：2021/4/8 20:28:00 apache-spark apache-spark-sql spark-dataframe

本文介绍了需要了解Dataframe Spark中的分区详细信息的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试基于查询从DB2数据库读取.查询的结果集大约有20-4000万条记录.DF的划分是基于整数列进行的.

I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer.

我的问题是，一旦加载数据，如何检查每个分区创建了多少条记录.基本上，我要检查的是是否发生数据偏斜?如何查看每个分区的记录计数?

My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition?

推荐答案

例如，您可以映射分区并确定其大小:

You can for instance map over the partitions and determine their sizes:

val rdd = sc.parallelize(0 until 1000, 3)
val partitionSizes = rdd.mapPartitions(iter => Iterator(iter.length)).collect()

// would be Array(333, 333, 334) in this example

这同时适用于RDD和Dataset/DataFrame API.

This works for both the RDD and the Dataset/DataFrame API.

这篇关于需要了解Dataframe Spark中的分区详细信息的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

需要了解Dataframe Spark中的分区详细信息 [英] Need to Know Partitioning Details in Dataframe Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

需要了解Dataframe Spark中的分区详细信息 [英] Need to Know Partitioning Details in Dataframe Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭