Spark'limit'不能并行运行吗? [英] Spark 'limit' does not run in parallel?

查看:60
本文介绍了Spark'limit'不能并行运行吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的连接,但我只能限制在侧面.在说明计划中,我看到在执行限制之前有一个ExchangeSingle操作,实际上我看到在集群中只有一个任务正在运行.

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.

这当然会极大地影响性能(取消限制将消除单个任务瓶颈,但会因为它在更大的数据集上工作而延长了联接).

This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).

限制真的不可并行化吗?如果是这样,是否有解决方法?

Is limit truly not parallelizable? and if so- is there a workaround for this?

我正在Databricks集群上使用Spark.

I am using spark on Databricks cluster.

关于可能的重复项.答案并没有解释为什么所有内容都被改组到一个分区中.另外-我要求提供解决此问题的建议.

regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.

推荐答案

在遵循user8371915在注释中给出的建议之后,我使用了示例而不是限制.它消除了瓶颈.

Following the advice given by user8371915 in the comments, I used sample instead of limit. And it uncorked the bottleneck.

一个很小但很重要的细节:采样后,我仍然必须对结果集施加可预测的大小约束,但是采样输入的分数很小,因此结果集的大小可能在很大程度上取决于输入的大小.

A small but important detail: I still had to put a predictable size constraint on the result set after sample, but sample inputs a fraction, so the size of the result set can very greatly depending on the size of the input.

对我来说幸运的是,使用count()运行相同的查询非常快.因此,我首先计算了整个结果集的大小,然后使用它来计算稍后在样本中使用的分数.

Fortunately for me, running the same query with count() was very fast. So I first counted the size of the entire result set and used it to compute the fraction I later used in the sample.

这篇关于Spark'limit'不能并行运行吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆