如何知道哪个计数查询最快? [英] How to know which count query is the fastest?

查看：29 发布时间：2021/11/14 21:42:44 performance apache-spark query-optimization apache-spark-sql

本文介绍了如何知道哪个计数查询最快?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在探索最近版本的 Spark SQL 2.3.0-SNAPSHOT 中的查询优化，并注意到语义相同查询的不同物理计划.

I've been exploring query optimizations in the recent releases of Spark SQL 2.3.0-SNAPSHOT and noticed different physical plans for semantically-identical queries.

假设我必须计算以下数据集中的行数:

Let's assume I've got to count the number of rows in the following dataset:

val q = spark.range(1)

我可以按如下方式计算行数:

I could count the number of rows as follows:

q.count
q.collect.size
q.rdd.count
q.queryExecution.toRdd.count

我最初的想法是，它几乎是一个恒定的操作(肯定是由于本地数据集)，不知何故已被 Spark SQL 优化并立即给出结果，尤其是.第一个 Spark SQL 完全控制查询执行.

My initial thought was that it's almost a constant operation (surely due to a local dataset) that would somehow have been optimized by Spark SQL and would give a result immediately, esp. the 1st one where Spark SQL is in full control of the query execution.

查看查询的物理计划后，我相信最有效的查询将是最后一个:

Having had a look at the physical plans of the queries led me to believe that the most effective query would be the last:

q.queryExecution.toRdd.count

原因是:

它避免从 InternalRow 二进制格式反序列化行
查询是代码生成的
只有一个阶段的工作

It avoids deserializing rows from their InternalRow binary format
The query is codegened
There's only one job with a single stage

物理计划就是这么简单.

The physical plan is as simple as that.

我的推理正确吗?如果是这样，如果我从外部数据源(例如文件、JDBC、Kafka)读取数据集，答案是否会有所不同?

Is my reasoning correct? If so, would the answer be different if I read the dataset from an external data source (e.g. files, JDBC, Kafka)?

主要问题是要考虑哪些因素才能判断某个查询是否比其他查询更有效(根据此示例)?

The main question is what are the factors to take into consideration to say whether a query is more efficient than others (per this example)?

其他执行计划的完整性.

The other execution plans for completeness.

如何知道哪个计数查询最快? [英] How to know which count query is the fastest?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何知道哪个计数查询最快? [英] How to know which count query is the fastest?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭