Spark“限制"不能并行运行? [英] Spark 'limit' does not run in parallel?

查看:31
本文介绍了Spark“限制"不能并行运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的连接,我限制了两边.在解释计划中我看到在执行限制之前有一个 ExchangeSingle 操作,确实我看到在这个阶段集群中只有一个任务在运行.

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.

这当然会显着影响性能(消除限制会消除单个任务瓶颈,但会延长连接,因为它适用于更大的数据集).

This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).

limit 真的不可并行吗?如果是这样 - 是否有解决方法?

Is limit truly not parallelizable? and if so- is there a workaround for this?

我在 Databricks 集群上使用 spark.

I am using spark on Databricks cluster.

关于可能的重复.答案并没有解释为什么所有东西都被改组到一个分区中.另外 - 我征求了解决此问题的建议.

regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.

推荐答案

按照 user8371915 在评论中给出的建议,我使用了 sample 而不是 limit.它解决了瓶颈.

Following the advice given by user8371915 in the comments, I used sample instead of limit. And it uncorked the bottleneck.

一个小而重要的细节:我仍然必须在样本后对结果集设置可预测的大小约束,但样本输入了一小部分,因此结果集的大小可能在很大程度上取决于输入的大小.

A small but important detail: I still had to put a predictable size constraint on the result set after sample, but sample inputs a fraction, so the size of the result set can very greatly depending on the size of the input.

对我来说幸运的是,使用 count() 运行相同的查询非常快.所以我首先计算了整个结果集的大小,并用它来计算我后来在样本中使用的分数.

Fortunately for me, running the same query with count() was very fast. So I first counted the size of the entire result set and used it to compute the fraction I later used in the sample.

这篇关于Spark“限制"不能并行运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆