如何知道哪个计数查询最快? [英] How to know which count query is the fastest?

查看:124
本文介绍了如何知道哪个计数查询最快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在最近发布的Spark SQL 2.3.0-SNAPSHOT中,我一直在探索查询优化,并且发现了语义相同的查询的不同物理计划.

I've been exploring query optimizations in the recent releases of Spark SQL 2.3.0-SNAPSHOT and noticed different physical plans for semantically-identical queries.

假设我必须计算以下数据集中的行数:

Let's assume I've got to count the number of rows in the following dataset:

val q = spark.range(1)

我可以计算如下行数:

  1. q.count
  2. q.collect.size
  3. q.rdd.count
  4. q.queryExecution.toRdd.count
  1. q.count
  2. q.collect.size
  3. q.rdd.count
  4. q.queryExecution.toRdd.count

我最初的想法是,这几乎是一个恒定的操作(肯定是由于本地数据集),它会通过Spark SQL优化 并立即给出结果,尤其是.第一个是Spark SQL完全控制查询执行的地方.

My initial thought was that it's almost a constant operation (surely due to a local dataset) that would somehow have been optimized by Spark SQL and would give a result immediately, esp. the 1st one where Spark SQL is in full control of the query execution.

看过查询的实际计划后,我相信,最有效的查询将是最后一个:

Having had a look at the physical plans of the queries led me to believe that the most effective query would be the last:

q.queryExecution.toRdd.count

原因是:

  1. 避免从其InternalRow二进制格式反序列化行
  2. 查询已进行代码生成
  3. 只有一个阶段的工作
  1. It avoids deserializing rows from their InternalRow binary format
  2. The query is codegened
  3. There's only one job with a single stage

身体计划就这么简单.

我的推理正确吗?如果是这样,如果我从外部数据源(例如文件,JDBC,Kafka)读取数据集,答案会有所不同吗?

Is my reasoning correct? If so, would the answer be different if I read the dataset from an external data source (e.g. files, JDBC, Kafka)?

主要问题是要考虑一个查询是否比其他查询更有效的因素(在此示例中)?

The main question is what are the factors to take into consideration to say whether a query is more efficient than others (per this example)?

其他执行计划的完整性.

The other execution plans for completeness.

推荐答案

我在val q = spark.range(100000000)上做了一些测试:

  1. q.count:约50毫秒
  2. q.collect.size:大约一分钟后我停止了查询...
  3. q.rdd.count:约1100毫秒
  4. q.queryExecution.toRdd.count:约600毫秒
  1. q.count: ~50 ms
  2. q.collect.size: I stopped the query after a minute or so...
  3. q.rdd.count: ~1100 ms
  4. q.queryExecution.toRdd.count: ~600 ms

一些解释:

到目前为止,选项1最快,因为它同时使用了部分聚合和整个阶段的代码生成.整个阶段的代码生成使JVM变得非常聪明,并进行了一些重大的优化(请参阅:

Option 1 is by far the fastest because it uses both partial aggregation and whole stage code generation. The whole stage code generation allows the JVM to get really clever and do some drastic optimizations (see: https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html).

选项2.只是速度较慢,会在驱动程序上具体化所有东西,通常是个坏主意.

Option 2. Is just slow and materializes everything on the driver, which is generally a bad idea.

选项3.与选项4相似,但这首先将内部行转换为常规行,这非常昂贵.

Option 3. Is like option 4, but this first converts an internal row to a regular row, and this is quite expensive.

选项4.在没有整个阶段代码生成的情况下,速度将与您差不多.

Option 4. Is about as fast you will get without whole stage code generation.

这篇关于如何知道哪个计数查询最快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆