如何手动运行pyspark的分区功能进行调试 [英] how to manually run pyspark's partitioning function for debugging

查看:79
本文介绍了如何手动运行pyspark的分区功能进行调试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了性能问题,在检查了 Spark 的 Web UI 后,我发现存在严重的数据偏斜问题:

I'm having performance issues and after checking spark's web UI i found there is a severe data skewness issue:

我尝试过分发方式"和重新分区"通过多个列组合而没有运气,所以我正在尝试调试 spark 如何对数据集进行分区(为了修复它),有没有办法手动运行分区函数来创建列?基本上我正在尝试做类似的事情:

I have tried "distribute by" and "repartition" by multiple combinations of columns with no luck so i'm trying to debug how spark is partitioning the dataset(in order to fix it), is there any way to manually run the partitioning function for creating a column? Basically i'm trying to do something similar to:

df = df.withColumn("assigned_partition", partitioning_function())
df_grouped  = df.groupby("assigned_partition").count()

这样我就可以确定偏斜的模式或原因.

So that i can identify patterns or reasons for skewness.

注意:这是在查询 hive 表之后,所以我知道偏度不是由于任何火花逻辑或计算造成的.

Note: this is right after querying hive tables, so i know the skewness is not due to any spark logic or calculations.

推荐答案

我发现可以使用:spark_partition_id(),然后我创建了一个直方图来分析 count() 分布,发现它看起来很正常并且没有异常值,所以偏度不是来自数据集的分区方式.

I found it is possible to do it using: spark_partition_id() , then i created a histogram to analyze the count() distribution and found it looks normal and without outliers, so skewness is not from how the dataset is partitioned.

test_df = df.select(spark_partition_id().alias("partitionId"))

test_df.groupBy("partitionId").count().orderBy(col("count").desc()).select("count").toPandas().plot.hist()

plt.show()

这篇关于如何手动运行pyspark的分区功能进行调试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆