Spark'limit'不能并行运行吗? [英] Spark 'limit' does not run in parallel?

查看：60 发布时间：2021/4/8 19:42:31 apache-spark pyspark pyspark-sql

本文介绍了Spark'limit'不能并行运行吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个简单的连接，但我只能限制在侧面.在说明计划中，我看到在执行限制之前有一个ExchangeSingle操作，实际上我看到在集群中只有一个任务正在运行.

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.

这当然会极大地影响性能(取消限制将消除单个任务瓶颈，但会因为它在更大的数据集上工作而延长了联接).

This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).

限制真的不可并行化吗?如果是这样，是否有解决方法?

Is limit truly not parallelizable? and if so- is there a workaround for this?

我正在Databricks集群上使用Spark.

I am using spark on Databricks cluster.

关于可能的重复项.答案并没有解释为什么所有内容都被改组到一个分区中.另外-我要求提供解决此问题的建议.

regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.

Spark'limit'不能并行运行吗? [英] Spark 'limit' does not run in parallel?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark'limit'不能并行运行吗? [英] Spark &#39;limit&#39; does not run in parallel?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark'limit'不能并行运行吗? [英] Spark 'limit' does not run in parallel?

登录关闭