需要了解Dataframe Spark中的分区详细信息 [英] Need to Know Partitioning Details in Dataframe Spark
问题描述
我正在尝试基于查询从DB2数据库读取.查询的结果集大约有20-4000万条记录.DF的划分是基于整数列进行的.
I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer.
我的问题是,一旦加载数据,如何检查每个分区创建了多少条记录.基本上,我要检查的是是否发生数据偏斜?如何查看每个分区的记录计数?
My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition?
推荐答案
例如,您可以映射分区并确定其大小:
You can for instance map over the partitions and determine their sizes:
val rdd = sc.parallelize(0 until 1000, 3)
val partitionSizes = rdd.mapPartitions(iter => Iterator(iter.length)).collect()
// would be Array(333, 333, 334) in this example
这同时适用于RDD和Dataset/DataFrame API.
This works for both the RDD and the Dataset/DataFrame API.
这篇关于需要了解Dataframe Spark中的分区详细信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!