是否有任何性能问题迫使使用火花计数进行急切评估? [英] Any performance issues forcing eager evaluation using count in spark?

查看:56
本文介绍了是否有任何性能问题迫使使用火花计数进行急切评估?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很常见,在3种情况下,我在所有代码库中都看到了Dataset.count:

Commonly I see Dataset.count throughout codebases in 3 scenarios:

  1. 记录log.info("this ds has ${dataset.count} rows")
  2. 分支if (dataset.count > 0) do x else do y
  3. 强制缓存dataset.persist.count
  1. logging log.info("this ds has ${dataset.count} rows")
  2. branching if (dataset.count > 0) do x else do y
  3. force a cache dataset.persist.count

是否通过迫使查询优化器在任何一种情况下都过早渴望而阻止了查询优化器的创建?

Does it prevent the query optimizer from creating the most efficient dag by forcing it to be eager prematurely in any of those scenarios?

推荐答案

TL; DR 1)和2)通常可以避免,但不应伤害您(忽略评估成本), 3)通常是有害的货运邪教编程做法.

TL;DR 1) and 2) can be usually avoided but shouldn't harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice.

没有cache

Without cache

单独致电count通常很浪费.尽管并不总是那么简单,但日志记录可以替换为从侦听器中检索到的信息(这是RDD的示例),并且控制流程通常(并非总是)可以通过更好的管道设计来协调需求.

Calling count alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control flow requirements can be usually (not always) mediated with a better pipeline design.

单独执行不会对执行计划产生任何影响(计数执行计划通常无论如何与父执行计划都不同.通常,Spark所做的工作尽可能少,因此它将删除部分执行计划,无需计算计数.

Alone it won't have any impact on execution plan (execution plan for count, is normally different from the execution plan of the parent anyway. In general Spark does as little work as possible, so it will remove parts of the execution plan, which are not required to compute count).

使用cache :

With cache:

count是从与RDD API一起使用的模式中天真的复制的错误做法. RDDs已经引起争议,但是DataFrame可以破坏很多内部优化(选择和谓词下推),从技术上讲,

count with cache is bad practice naively copied from patterns used with RDD API. It is already disputable with RDDs, but with DataFrame can break a lot of internal optimizations (selection and predicate pushdown) and technically speaking, is not even guaranteed to work.

这篇关于是否有任何性能问题迫使使用火花计数进行急切评估?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆