Spark“限制"不能并行运行? [英] Spark 'limit' does not run in parallel?

查看：31 发布时间：2021/11/14 22:47:16 apache-spark pyspark pyspark-sql

本文介绍了Spark“限制"不能并行运行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个简单的连接，我限制了两边.在解释计划中我看到在执行限制之前有一个 ExchangeSingle 操作，确实我看到在这个阶段集群中只有一个任务在运行.

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster.

这当然会显着影响性能(消除限制会消除单个任务瓶颈，但会延长连接，因为它适用于更大的数据集).

This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset).

limit 真的不可并行吗?如果是这样 - 是否有解决方法?

Is limit truly not parallelizable? and if so- is there a workaround for this?

我在 Databricks 集群上使用 spark.

I am using spark on Databricks cluster.

关于可能的重复.答案并没有解释为什么所有东西都被改组到一个分区中.另外 - 我征求了解决此问题的建议.

regarding the possible duplicate. The answer does not explain why everything is shuffled into a single partition. Also- I asked for advice to work around this issue.

Spark“限制"不能并行运行? [英] Spark 'limit' does not run in parallel?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark“限制"不能并行运行? [英] Spark &#39;limit&#39; does not run in parallel?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark“限制"不能并行运行? [英] Spark 'limit' does not run in parallel?

登录关闭