如何知道哪个计数查询最快? [英] How to know which count query is the fastest?

查看:29
本文介绍了如何知道哪个计数查询最快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在探索最近版本的 Spark SQL 2.3.0-SNAPSHOT 中的查询优化,并注意到语义相同查询的不同物理计划.

I've been exploring query optimizations in the recent releases of Spark SQL 2.3.0-SNAPSHOT and noticed different physical plans for semantically-identical queries.

假设我必须计算以下数据集中的行数:

Let's assume I've got to count the number of rows in the following dataset:

val q = spark.range(1)

我可以按如下方式计算行数:

I could count the number of rows as follows:

  1. q.count
  2. q.collect.size
  3. q.rdd.count
  4. q.queryExecution.toRdd.count

我最初的想法是,它几乎是一个恒定的操作(肯定是由于本地数据集),不知何故已被 Spark SQL 优化并立即给出结果,尤其是.第一个 Spark SQL 完全控制查询执行.

My initial thought was that it's almost a constant operation (surely due to a local dataset) that would somehow have been optimized by Spark SQL and would give a result immediately, esp. the 1st one where Spark SQL is in full control of the query execution.

查看查询的物理计划后,我相信最有效的查询将是最后一个:

Having had a look at the physical plans of the queries led me to believe that the most effective query would be the last:

q.queryExecution.toRdd.count

原因是:

  1. 它避免从 InternalRow 二进制格式反序列化行
  2. 查询是代码生成的
  3. 只有一个阶段的工作
  1. It avoids deserializing rows from their InternalRow binary format
  2. The query is codegened
  3. There's only one job with a single stage

物理计划就是这么简单.

The physical plan is as simple as that.

我的推理正确吗?如果是这样,如果我从外部数据源(例如文件、JDBC、Kafka)读取数据集,答案是否会有所不同?

Is my reasoning correct? If so, would the answer be different if I read the dataset from an external data source (e.g. files, JDBC, Kafka)?

主要问题是要考虑哪些因素才能判断某个查询是否比其他查询更有效(根据此示例)?

The main question is what are the factors to take into consideration to say whether a query is more efficient than others (per this example)?

其他执行计划的完整性.

The other execution plans for completeness.

推荐答案

我对 val q = spark.range(100000000) 做了一些测试:

I did some testing on val q = spark.range(100000000):

  1. q.count:~50 毫秒
  2. q.collect.size:我在一分钟左右后停止了查询......
  3. q.rdd.count:~1100 毫秒
  4. q.queryExecution.toRdd.count:~600 毫秒
  1. q.count: ~50 ms
  2. q.collect.size: I stopped the query after a minute or so...
  3. q.rdd.count: ~1100 ms
  4. q.queryExecution.toRdd.count: ~600 ms

一些解释:

选项 1 是迄今为止最快的,因为它同时使用了部分聚合和全阶段代码生成.整个阶段的代码生成让 JVM 变得非常聪明并进行了一些剧烈的优化(参见:https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html).

Option 1 is by far the fastest because it uses both partial aggregation and whole stage code generation. The whole stage code generation allows the JVM to get really clever and do some drastic optimizations (see: https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html).

选项 2. 速度很慢并且在驱动程序上实现所有内容,这通常是一个坏主意.

Option 2. Is just slow and materializes everything on the driver, which is generally a bad idea.

选项 3.与选项 4 类似,但它首先将内部行转换为常规行,这非常昂贵.

Option 3. Is like option 4, but this first converts an internal row to a regular row, and this is quite expensive.

选项 4.在没有整个阶段代码生成的情况下,速度与您的速度差不多.

Option 4. Is about as fast you will get without whole stage code generation.

这篇关于如何知道哪个计数查询最快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆