是否有任何性能问题迫使在 spark 中使用计数进行急切评估? [英] Any performance issues forcing eager evaluation using count in spark?
问题描述
通常我会在 3 种情况下的整个代码库中看到 Dataset.count
:
Commonly I see Dataset.count
throughout codebases in 3 scenarios:
- logging
log.info("这个 ds 有 ${dataset.count} 行")
- 分支
if (dataset.count > 0) do x else do y
- 强制缓存
dataset.persist.count
是否会通过强制查询优化器在任何这些场景中过早地急切来阻止查询优化器创建最有效的 dag?
Does it prevent the query optimizer from creating the most efficient dag by forcing it to be eager prematurely in any of those scenarios?
推荐答案
TL;DR 1) 和 2) 通常可以避免但不应该伤害您(忽略评估成本),3) 通常是一种有害的货物崇拜编程做法.
TL;DR 1) and 2) can be usually avoided but shouldn't harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice.
没有缓存
单独调用 count
大多是浪费.虽然并不总是很简单,但日志记录可以替换为从侦听器检索到的信息(这里是 RDD 的示例)和控制流通常(并非总是)可以通过更好的管道设计来调节需求.
Calling count
alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control flow requirements can be usually (not always) mediated with a better pipeline design.
单独它不会对执行计划产生任何影响(计数的执行计划,无论如何通常与父级的执行计划不同.通常Spark尽可能少做工作,因此它会删除部分执行计划,不需要计算计数).
Alone it won't have any impact on execution plan (execution plan for count, is normally different from the execution plan of the parent anyway. In general Spark does as little work as possible, so it will remove parts of the execution plan, which are not required to compute count).
使用缓存
:
count
with cache
是从与 RDD API 一起使用的模式中天真地复制的不好的做法.它已经与 RDDs
存在争议,但是使用 DataFrame
可以打破很多内部优化(选择和谓词下推),从技术上讲,甚至不能保证工作.
count
with cache
is bad practice naively copied from patterns used with RDD API. It is already disputable with RDDs
, but with DataFrame
can break a lot of internal optimizations (selection and predicate pushdown) and technically speaking, is not even guaranteed to work.
这篇关于是否有任何性能问题迫使在 spark 中使用计数进行急切评估?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!