缓存有序 Spark DataFrame 创建不需要的作业 [英] Caching ordered Spark DataFrame creates unwanted job

查看：32 发布时间：2021/11/14 21:46:11 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了缓存有序 Spark DataFrame 创建不需要的作业的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将 RDD 转换为 DataFrame 并希望缓存 RDD 的结果:

I want to convert a RDD to a DataFrame and want to cache the results of the RDD:

from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn

schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])

df = spark.createDataFrame(
    sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
    schema=schema,
    verifySchema=False
).orderBy("t") #.cache()

如果您不使用 cache 函数，则不会生成任何作业.
如果仅在为 cache 生成 orderBy 1 个作业后才使用 cache:
如果您仅在 parallelize 之后使用 cache，则不会生成作业.

If you don't use a cache function no job is generated.
If you use cache only after the orderBy 1 jobs is generated for cache:
If you use cache only after the parallelize no job is generated.

为什么在这种情况下 cache 会生成作业?如何避免 cache 的作业生成(缓存 DataFrame 而没有 RDD)?

Why does cache generate a job in this one case? How can I avoid the job generation of cache (caching the DataFrame and no RDD)?

编辑:我对问题进行了更多调查，发现没有 orderBy("t") 就不会生成作业.为什么?

Edit: I investigated more into the problem and found that without the orderBy("t") no job is generated. Why?

缓存有序 Spark DataFrame 创建不需要的作业 [英] Caching ordered Spark DataFrame creates unwanted job

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

缓存有序 Spark DataFrame 创建不需要的作业 [英] Caching ordered Spark DataFrame creates unwanted job

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭