为什么 Spark 查询在第二次执行时运行得更快? [英] Why does a Spark query run faster when it's executed a second time?

查看:30
本文介绍了为什么 Spark 查询在第二次执行时运行得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我第二次运行查询时速度明显加快.为什么?

The second time I run a query it's significantly faster. Why?

代码:

publicvoidtest3() {
    Dataset<Row>SQLDF=spark.read().json(path:"src/main/resources/data/ipl.json");
    SQLDF.repartition(2);
    Dataset<Row>result1=SqlDF.where("run>10000").select(col:"team",...cols:"run");
    //Dataset<Row>cachedPartition=result1.cache();
    result.collect();
    //result1.show();log.info("PhysicalPlan\n"+result1.queryExecution().executedPlan());

    Dataset<Row>result2=SqlDF.where("run>10000").select(col:"team",..cols:"run");
    result2.collect();
    //result1.show();
    Log.info("PhysicalPlan\n"+result2.queryExecution().executedPlanq);
}

物理计划:

spark UI 上的执行时间:

Execution time on spark UI:

为什么这些查询需要不同的时间,为什么执行时间有如此大的差异?缓存是否在幕后发生?如果是,为什么在物理计划中没有提及?

Why these queries are taking different time and why there is so much difference in execution time? Is caching happening under the hood? If yes, why it is not mentioned in physical plan?

推荐答案

您将 Spark 指向一个文件.第二次访问同一个文件时,该文件的访问速度会更快.

You're pointing Spark to a file. The second time you access the same file, the file will be accessed faster.

如果您运行以下代码两次,情况是一样的(当然,Scala 使用 JVM 和 java.nio 和 java.io 除外).

It's the same situation if you run the following code twice (except Scala uses the JVM and java.nio and java.io, of course).

with open("src/main/resources/data/ipl.json") as f:
    t = f.read()
print(t)

第一次,必须初始化 I/O 操作.第二次,I/O 操作可以重用上次运行的部分内容.如果文件很小(就像您的情况一样),则整个文件都将被缓存.

The first time, the I/O operation will have to be initialized. The second time, the I/O operation can reuse parts of the last run. If the file is small (as it seems to be in your case), the whole file will have been cached.

这篇关于为什么 Spark 查询在第二次执行时运行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆