如何在倾斜的列上重新分区 Spark scala 中的数据帧? [英] How to repartition a dataframe in Spark scala on a skewed column?

查看:33
本文介绍了如何在倾斜的列上重新分区 Spark scala 中的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,它有 500 个分区并且被打乱了.我想根据一栏说城市"重新分区但是 city 列非常倾斜,因为它只有三个可能的值.所以当我根据列城市重新分区时,即使我指定了 500 个分区,也只有三个正在获取数据.因此,我遇到了性能问题.我在互联网上搜索,但找不到任何合适的解决方案.有没有办法在基于城市列的分区之间统一重新分区数据帧.我需要的是:city1 表示前 5 个分区,city2 表示接下来的 490 个分区,city3 表示剩余的 5 个分区.

I have a dataframe which has 500 partitions and is shuffled. I want to repartition it based on one column say 'city' But the city column is extremely skewed as it has only three possible values. So when I repartition​ based on column city, even if I specify 500 number of partitions, only three are getting data. Because of this I am running into performance issues. I searched on internet but could not find any suitable solution. Is there a way to repartition the dataframe uniformly across partitions based in city column. What I need is: city1 goes to say first 5 partitions, city2 goes to next 490 partitions and city3 goes to remaining 5 partitions.

推荐答案

当我们遇到已知偏斜的数据时,我们使用了一个分区器,该分区器对偏斜值应用受控随机化.我在本答案中概述了如何做到这一点.

When we've encountered data with known skew, we've used a partitioner that applies controlled randomization for the skewed values. I outline how this can be done in this answer.

这篇关于如何在倾斜的列上重新分区 Spark scala 中的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆