需要了解 Dataframe Spark 中的分区细节 [英] Need to Know Partitioning Details in Dataframe Spark

查看:34
本文介绍了需要了解 Dataframe Spark 中的分区细节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据查询从 DB2 数据库中读取数据.查询的结果集大约有 20 - 4000 万条记录.DF 的分区是基于一个整数列完成的.

I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer.

我的问题是,一旦加载了数据,我如何检查每个分区创建了多少条记录.基本上我想检查的是数据倾斜是否发生?如何检查每个分区的记录数?

My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition?

推荐答案

例如,您可以映射分区并确定它们的大小:

You can for instance map over the partitions and determine their sizes:

val rdd = sc.parallelize(0 until 1000, 3)
val partitionSizes = rdd.mapPartitions(iter => Iterator(iter.length)).collect()

// would be Array(333, 333, 334) in this example

这适用于 RDD 和 Dataset/DataFrame API.

This works for both the RDD and the Dataset/DataFrame API.

这篇关于需要了解 Dataframe Spark 中的分区细节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆