缓存有序 Spark DataFrame 创建不需要的作业 [英] Caching ordered Spark DataFrame creates unwanted job
问题描述
我想将 RDD 转换为 DataFrame 并希望缓存 RDD 的结果:
I want to convert a RDD to a DataFrame and want to cache the results of the RDD:
from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn
schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])
df = spark.createDataFrame(
sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
schema=schema,
verifySchema=False
).orderBy("t") #.cache()
- 如果您不使用
cache
函数,则不会生成任何作业. - 如果仅在为
cache
生成orderBy
1 个作业后才使用cache
: - 如果您仅在
parallelize
之后使用cache
,则不会生成作业. - If you don't use a
cache
function no job is generated. - If you use
cache
only after theorderBy
1 jobs is generated forcache
: - If you use
cache
only after theparallelize
no job is generated.
为什么在这种情况下 cache
会生成作业?如何避免 cache
的作业生成(缓存 DataFrame 而没有 RDD)?
Why does cache
generate a job in this one case?
How can I avoid the job generation of cache
(caching the DataFrame and no RDD)?
编辑:我对问题进行了更多调查,发现没有 orderBy("t")
就不会生成作业.为什么?
Edit: I investigated more into the problem and found that without the orderBy("t")
no job is generated. Why?
推荐答案
我提交了一个错误票,并因以下原因关闭:
I submitted a bug ticket and it was closed with following reason:
缓存需要支持 RDD.这也要求我们知道支持分区,这对于全局订单有点特殊:它触发作业(扫描),因为我们需要确定分区边界.
Caching requires the backing RDD. That requires we also know the backing partitions, and this is somewhat special for a global order: it triggers a job (scan) because we need to determine the partition bounds.
这篇关于缓存有序 Spark DataFrame 创建不需要的作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!