为什么groupByKey操作总是有200个任务? [英] Why does groupByKey operation have always 200 tasks?

查看:49
本文介绍了为什么groupByKey操作总是有200个任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每当我在RDD上执行 groupByKey 时,即使原始表很大,例如,它也会拆分为200个作业.2k分区和数千万行.

Whenever I do a groupByKey on an RDD, it gets split up in 200 jobs, even if the original table is quite large, e.g. 2k partitions and tens of millions of rows.

此外,该操作似乎卡在了最后两个任务上,而这两个任务的计算时间非常长.

Moreover, the operation seems to get stuck on the last two tasks which take extremely long to compute.

为什么是200?如何增加它,对您有帮助吗?

Why is it 200? How to increase it and will it help?

推荐答案

此设置来自 spark.sql.shuffle.partitions ,它是分组时要使用的分区数,并且具有默认设置200 ,但可以增加.这可能会有所帮助,这将取决于群集和数据.

This setting comes from spark.sql.shuffle.partitions, which is the number of partitions to use when grouping, and has a default setting of 200, but can be increased. This may help, it will be dependent on the cluster and data.

最后两个任务花费的时间很长,这是由于数据偏斜所致,这些键包含更多的值.您可以使用 reduceByKey / combineByKey 而不是 groupByKey 还是将问题并行化?

The last two tasks taking very long will be due to skewed data, those keys contain many more values. Can you use reduceByKey / combineByKey rather than groupByKey, or parallelize the problem differently?

这篇关于为什么groupByKey操作总是有200个任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆