如何在Spark Scala中的倾斜列上重新划分数据帧? [英] How to repartition a dataframe in Spark scala on a skewed column?

查看：106 发布时间：2020/9/4 20:01:48 scala apache-spark apache-spark-sql

本文介绍了如何在Spark Scala中的倾斜列上重新划分数据帧?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有500个分区的数据帧，并对其进行了重新排序. 我想根据说城市"的一列对其进行分区但是city列非常偏斜，因为它只有三个可能的值. 因此，当我基于列城市进行分区时，即使我指定了500个分区数，也只有三个正在获取数据.因此，我遇到了性能问题. 我在互联网上搜索，但找不到任何合适的解决方案. 有没有一种方法可以在基于city列的分区之间均匀地重新划分数据框. 我需要的是:city1表示前5个分区，city2转到接下来的490个分区，city3转到其余5个分区.

I have a dataframe which has 500 partitions and is shuffled. I want to repartition it based on one column say 'city' But the city column is extremely skewed as it has only three possible values. So when I repartition based on column city, even if I specify 500 number of partitions, only three are getting data. Because of this I am running into performance issues. I searched on internet but could not find any suitable solution. Is there a way to repartition the dataframe uniformly across partitions based in city column. What I need is: city1 goes to say first 5 partitions, city2 goes to next 490 partitions and city3 goes to remaining 5 partitions.

如何在Spark Scala中的倾斜列上重新划分数据帧? [英] How to repartition a dataframe in Spark scala on a skewed column?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Spark Scala中的倾斜列上重新划分数据帧? [英] How to repartition a dataframe in Spark scala on a skewed column?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭