有没有一种惯用的方法来缓存 Spark 数据帧? [英] Is there an idiomatic way to cache Spark dataframes?

查看:55
本文介绍了有没有一种惯用的方法来缓存 Spark 数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型镶木地板数据集,正在用 Spark 读取.读取后,我会过滤出在许多应用不同转换的函数中使用的行子集:

I have a large parquet dataset that I am reading with Spark. Once read, I filter for a subset of rows which are used in a number of functions that apply different transformations:

以下与我要完成的任务相似但不完全符合逻辑:

The following is similar but not exact logic to what I'm trying to accomplish:

df = spark.read.parquet(file)
special_rows = df.filter(col('special') > 0)

# Thinking about adding the following line
special_rows.cache()

def f1(df):
  new_df_1 = df.withColumn('foo', lit(0))
  return new_df_1

def f2(df):
  new_df_2 = df.withColumn('foo', lit(1))
  return new_df_2

new_df_1 = f1(special_rows)
new_df_2 = f2(special_rows)
output_df = new_df_1.union(new_df_2)
output_df.write.parquet(location)

因为许多函数可能正在使用这个过滤的行子集,我想缓存或持久化它,以便潜在地加快执行速度/内存消耗.我知道在上面的例子中,在我最终写入 parquet 之前不会调用任何操作.

Because a number of functions might be using this filtered subset of rows, I'd like to cache or persist it in order to potentially speed up execution speed / memory consumption. I understand that in the above example, there is no action called until my final write to parquet.

我的问题是,我是否需要插入对 count() 的某种调用,例如,为了触发缓存,或者如果 Spark 在最终写入 parquet 调用期间能够看到此数据帧正在 f1f2 中使用,并将缓存数据帧本身.

My questions is, do I need to insert some sort of call to count(), for example, in order to trigger the caching, or if Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.

如果是,这是一种惯用的方法吗?这是否意味着在依赖缓存的生产和大规模 Spark 作业中,经常使用随机操作来强制抢先对数据帧执行操作,例如调用 count?

If yes, is this an idiomatic approach? Does this mean in production and large scale Spark jobs that rely on caching, random operations that force an action on the dataframe pre-emptively are frequently used, such as a call to count?

推荐答案

在我最后一次写入镶木地板之前,不会调用任何操作.

there is no action called until my final write to parquet.

在最后一次写入 parquet 调用期间,Spark 将能够看到此数据帧正在 f1 和 f2 中使用,并将缓存数据帧本身.

Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.

是正确的.如果你做output_df.explain(),你会看到查询计划,这将表明你说的是正确的.

are correct. If you do output_df.explain(), you will see the query plan, which will show that what you said is correct.

因此,无需执行special_rows.cache().通常,cache 仅在您打算 after 强制 Spark 计算某些内容后重用数据帧时才需要,例如在 writeshow 之后.如果您发现自己故意调用 count(),那么您可能做错了什么.

Thus, there is no need to do special_rows.cache(). Generally, cache is only necessary if you intend to reuse the dataframe after forcing Spark to calculate something, e.g. after write or show. If you see yourself intentionally calling count(), you're probably doing something wrong.

这篇关于有没有一种惯用的方法来缓存 Spark 数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆